Open StevenCannon-USDA opened 1 month ago
Tagging especially @adf-ncgr, @ctcncgr, @nathanweeks for your review & consideration
seems like a good idea to me, though I think perhaps some more thought is needed around the "additional fields with programmatic use" aspect. For example, "display: true" could be interpreted in a lot of ways (e.g. it could mean that any file not tagged with that shouldn't appear in the h5ai view). Maybe it would be swinging too far in the direction of specificity but I could imagine using attributes in such a file to specify exactly into which of our various systems a given data file has (or should/should not be) included (e.g. glycinemine: true, sequenceserver.legumeinfo.org: false). But I definitely like having a programmatic location for attributes like description that could be consumed by things like the autocontent scripts.
As far as file-naming conventions are concerned, I see your point but I do also think that we ought to maintain the established file naming conventions (e.g. genome_main.fna) where they suffice (you probably weren't suggesting that we overturn them, just wanted to be clear about it...)
"we ought to maintain the established file naming conventions (e.g. genome_main.fna)"
Yes, for sure.
This conversation arose regarding the diversity collections, which have been minimally specified to-date. It is possible that the extra field(s) would only be used for that file type -- but I could imagine them being used for others such as synteny tracks, expression data, etc.
OK, I have generated (provisional) MANIFEST files for all of the Glycine/max/diversity collections. The intent is for those to be tracked as metadata, so I have modified the .gitignore to ignore the two-file MANIFESTS elsewhere through the Data Store. Specifically, these are ignored:
MANIFEST.*.correspondence.yml
MANIFEST.*.descriptions.yml
... while this one is tracked:
MANIFEST.Wm82.gnm1.div.Hu_Zhang_2020.yml
After discussion at LIS/PB/SB meeting today, I have changed the display
field to applications
, and implemented it for the Glycine/max/diversity collections. @nathanweeks @adf-ncgr
Here is an approximate specification -- which I'll put in place at the datastore-specifications repository pending discussion here:
For each Data Store collection, a single file MANIFEST.collection_name.yml will be used to provide basic information about data files in the collection.
The MANIFEST file must include, for each data file (bgzipped or in rare special cases gzipped), the name of the file and a description of the file. The MANIFEST file should NOT include index files (e.g. gz.tbi or .gz.fai) and should NOT include other metadata files.
Optional additional fields: applications, with yaml array of one or more applications that should use the indicated file; and prior_names, with yaml array of one or more previous names for the indicated file.
If no application in LegumeInfo/PeanutBase/SoyBase directly consumes a file, "applications" should not be specified (in that case, omit this field). In some cases, an application discovers certain files by other means (genome_main.fna and gene_models_main.gff3); in those cases, an "applications" field should also not be specified.
The file must be valid yaml, as tested with yamllint or equivalent.
Example:
cat Wm82.gnm1.div.Hu_Zhang_2020/MANIFEST.Wm82.gnm1.div.Hu_Zhang_2020.yml
---
- name: glyma.Wm82.gnm1.div.Hu_Zhang_2020.SNPdata.hmp.gz
description: Genotype information for 96 wild soybean accessions in hapmap format
prior_names:
- glyma.Wm82.gnm1.div.WKJG.SNPdata.hmp.gz
- 96w_used_tassel.hmp.txt
- name: glyma.Wm82.gnm1.div.Hu_Zhang_2020.SNPdata.vcf.gz
description: Genotype information for 96 wild soybean accessions in VCF format
applications:
- jbrowse
prior_names:
- glyma.Wm82.gnm1.div.WKJG.SNPdata.vcf.gz
- 96w_used.vcf.gz
We've had MANIFEST files nearly since the inception of the Data Store (actually, the content was originally included in the README file in each collection, but we decided early-on to move the per-file metadata into two MANIFEST files). However, these files aren't validated and aren't (to my knowledge) being used programmatically.
There are potential uses for per-file metadata though. For example, in the diversity collections, there are often multiple VCFs. Which of these should be displayed in tools such as JBrowse and GCViT? We could a file-naming convention, as we've done with e.g. "genome_main.gff3" in the annotations collections, but if multiple VCFs in a collection should be displayed, a label such as "main" isn't appropriate.
So, the proposal:
An example of the proposed merged, restructured file, from collection
Glycine/max/diversity/Wm82.gnm2.div.Wickland_Battu_2017