Closed emileyfadrosh closed 2 years ago
@emileyfadrosh Is kitware planning to use the file information? I can add file info as a comment in the schema, but it doesn't get translated into the jsonschema.
Sorry, I should have clarified: Donny will need the file information to make sure that is propagated to kitware. Probably not needed to add as a comment in the schema, but I will let @dwinston comment -- thoughts? Thanks!
@emileyfadrosh are you expecting, for example, a file with a name ending in “_tigrfam.gff” to automatically be assigned a file type of “TIGRFam Annotation GFF” on submission? If so, then this will need to be added to the schema so that it can be used formally by any code that processes metadata during submission.
@dwinston I've added the enums, but I need to add some utility methods for you to pull out the file name patterns.
@dwinston The new file enums are in the latest release of nmdc-schema
. I've added a nmdc-data
command line util:
nmdc-data -h
Usage: nmdc-data [OPTIONS]
Options:
-f, --fetch TEXT Fetches the specified data file from the nmdc-schema library.
Only one argument is permitted.
fetch arguments:
yaml returns the nmdc.yaml file as a string
jsonschema returns the NMDC jsonschema as json
dict returns the NMDC jsonschema as a dict
schemadef returns the SchemaDefintion created from the nmdc.yaml file
filetypeenums returns informaton about the NMDC file type enums as json
goldsssom returns the gold-to-mixs.sssom.tsv file contents
-h, --help Show this message and exit.
If you execute nmdc-data --fetch filetypeenums
, you will json output that looks like this:
[
{
"name": "FT ICR-MS Analysis Results",
"description": "FT ICR-MS-based metabolite assignment results table",
"file_name_pattern": null
},
{
"name": "GC-MS Metabolomics Results",
"description": "GC-MS-based metabolite assignment results table",
"file_name_pattern": null
},
...
{
"name": "SMART Annotation GFF",
"description": "GFF3 format file with SMART",
"file_name_pattern": "[GOLD-AP]_smart.gff"
},
{
"name": "Pfam Annotation GFF",
"description": "GFF3 format file with Pfam",
"file_name_pattern": "[GOLD-AP]_pfam.gff"
}
]
Can you work with this structure? If not, I can easily change it.
You can also access the underlying functions in nmdc_data.py
@wdduncan Based on discussion during the API tutorial with @dwinston earlier today, we thought it would be a good idea to test a quick update to the schema to ensure we could ingest additional files. Could you please add the below under enums:
TIGRFam Annotation GFF: description: GFF3 format file with TIGRfam file: [GOLD-AP]_tigrfam.gff
Clusters of Orthologous Groups (COG) Annotation GFF: description: GFF3 format file with COGs file: [GOLD-AP]_cog.gff
CATH FunFams (Functional Families) Annotation GFF: description: GFF3 format file with CATH FunFams file: [GOLD-AP]_cath_funfam.gff
SUPERFam Annotation GFF: description: GFF3 format file with SUPERFam file: [GOLD-AP]_supfam.gff
SMART Annotation GFF: description: GFF3 format file with SMART file: [GOLD-AP]_smart.gff
Pfam Annotation GFF: description: GFF3 format file with Pfam file: [GOLD-AP]_pfam.gff
The file [GOLD-AP] designation is just to identify which file is being pulled from the workflow output. Please let me know if you have any questions, thanks!
@ssarrafan @scanon @hubin-keio