microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

Update schema to include additional analysis output files #207

Closed emileyfadrosh closed 2 years ago

emileyfadrosh commented 2 years ago

@wdduncan Based on discussion during the API tutorial with @dwinston earlier today, we thought it would be a good idea to test a quick update to the schema to ensure we could ingest additional files. Could you please add the below under enums:

TIGRFam Annotation GFF: description: GFF3 format file with TIGRfam file: [GOLD-AP]_tigrfam.gff

Clusters of Orthologous Groups (COG) Annotation GFF: description: GFF3 format file with COGs file: [GOLD-AP]_cog.gff

CATH FunFams (Functional Families) Annotation GFF: description: GFF3 format file with CATH FunFams file: [GOLD-AP]_cath_funfam.gff

SUPERFam Annotation GFF: description: GFF3 format file with SUPERFam file: [GOLD-AP]_supfam.gff

SMART Annotation GFF: description: GFF3 format file with SMART file: [GOLD-AP]_smart.gff

Pfam Annotation GFF: description: GFF3 format file with Pfam file: [GOLD-AP]_pfam.gff

The file [GOLD-AP] designation is just to identify which file is being pulled from the workflow output. Please let me know if you have any questions, thanks!

@ssarrafan @scanon @hubin-keio

wdduncan commented 2 years ago

@emileyfadrosh Is kitware planning to use the file information? I can add file info as a comment in the schema, but it doesn't get translated into the jsonschema.

emileyfadrosh commented 2 years ago

Sorry, I should have clarified: Donny will need the file information to make sure that is propagated to kitware. Probably not needed to add as a comment in the schema, but I will let @dwinston comment -- thoughts? Thanks!

dwinston commented 2 years ago

@emileyfadrosh are you expecting, for example, a file with a name ending in “_tigrfam.gff” to automatically be assigned a file type of “TIGRFam Annotation GFF” on submission? If so, then this will need to be added to the schema so that it can be used formally by any code that processes metadata during submission.

wdduncan commented 2 years ago

@dwinston I've added the enums, but I need to add some utility methods for you to pull out the file name patterns.

wdduncan commented 2 years ago

@dwinston The new file enums are in the latest release of nmdc-schema. I've added a nmdc-data command line util:

nmdc-data -h     
Usage: nmdc-data [OPTIONS]

Options:
  -f, --fetch TEXT  Fetches the specified data file from the nmdc-schema library.
                    Only one argument is permitted.

                    fetch arguments:
                    yaml            returns the nmdc.yaml file as a string
                    jsonschema      returns the NMDC jsonschema as json
                    dict            returns the NMDC jsonschema as a dict
                    schemadef       returns the SchemaDefintion created from the nmdc.yaml file
                    filetypeenums   returns informaton about the NMDC file type enums as json
                    goldsssom       returns the gold-to-mixs.sssom.tsv file contents

  -h, --help        Show this message and exit.

If you execute nmdc-data --fetch filetypeenums, you will json output that looks like this:

[
  {
    "name": "FT ICR-MS Analysis Results",
    "description": "FT ICR-MS-based metabolite assignment results table",
    "file_name_pattern": null
  },
  {
    "name": "GC-MS Metabolomics Results",
    "description": "GC-MS-based metabolite assignment results table",
    "file_name_pattern": null
  },
  ...
 {
    "name": "SMART Annotation GFF",
    "description": "GFF3 format file with SMART",
    "file_name_pattern": "[GOLD-AP]_smart.gff"
  },
  {
    "name": "Pfam Annotation GFF",
    "description": "GFF3 format file with Pfam",
    "file_name_pattern": "[GOLD-AP]_pfam.gff"
  }
]

Can you work with this structure? If not, I can easily change it.
You can also access the underlying functions in nmdc_data.py