legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore
2 stars 0 forks source link

RFO: Restructure, formalize, and validate MANIFEST files #50

Open StevenCannon-USDA opened 1 month ago

StevenCannon-USDA commented 1 month ago

We've had MANIFEST files nearly since the inception of the Data Store (actually, the content was originally included in the README file in each collection, but we decided early-on to move the per-file metadata into two MANIFEST files). However, these files aren't validated and aren't (to my knowledge) being used programmatically.

There are potential uses for per-file metadata though. For example, in the diversity collections, there are often multiple VCFs. Which of these should be displayed in tools such as JBrowse and GCViT? We could a file-naming convention, as we've done with e.g. "genome_main.gff3" in the annotations collections, but if multiple VCFs in a collection should be displayed, a label such as "main" isn't appropriate.

So, the proposal:

  1. Formalize the MANIFEST file as a (validated) yml document
  2. Merge the current "descriptions" and "correspondence" files into a single file MANIFEST.metadata_file_prefix.yml
  3. Allow additional fields with programmatic use, e.g. "display: true"

An example of the proposed merged, restructured file, from collection Glycine/max/diversity/Wm82.gnm2.div.Wickland_Battu_2017

cat MANIFEST.Wm82.gnm2.div.Wickland_Battu_2017.yml
---
- name: glyma.Wm82.gnm2.div.Wickland_Battu_2017.SNPdata1.vcf.gz
  description: genotype information from Population 1; 378 F2 lines resulting from
    a cross between Prize and an NMU-mutagenized individual of Williams 82.
  display: true
  prior_names:
    - glyma.Wm82.gnm2.div.RW0X.SNPdata1.vcf.gz
    - Pop1_SNPs_minDP2.vcf.gz
- name: glyma.Wm82.gnm2.div.Wickland_Battu_2017.SNPdata2.vcf.gz
  description: genotype information from Population 2; 391 F2 individuals from a -
    cross between two breeding lines.
  display: true
  prior_names:
    - glyma.Wm82.gnm2.div.RW0X.SNPdata2.vcf.gz
    - Pop2_SNPs_minDP2.vcf.gz
- name: glyma.Wm82.gnm2.div.Wickland_Battu_2017.SNPdata3.vcf.gz
  description: genotype information from Population 3; 81 unrelated accessions -
    that form an association panel.
  display: true
  prior_names:
    - glyma.Wm82.gnm2.div.RW0X.SNPdata3.vcf.gz
    - Pop3_SNPs_minDP2.vcf.gz
StevenCannon-USDA commented 1 month ago

Tagging especially @adf-ncgr, @ctcncgr, @nathanweeks for your review & consideration

adf-ncgr commented 1 month ago

seems like a good idea to me, though I think perhaps some more thought is needed around the "additional fields with programmatic use" aspect. For example, "display: true" could be interpreted in a lot of ways (e.g. it could mean that any file not tagged with that shouldn't appear in the h5ai view). Maybe it would be swinging too far in the direction of specificity but I could imagine using attributes in such a file to specify exactly into which of our various systems a given data file has (or should/should not be) included (e.g. glycinemine: true, sequenceserver.legumeinfo.org: false). But I definitely like having a programmatic location for attributes like description that could be consumed by things like the autocontent scripts.

As far as file-naming conventions are concerned, I see your point but I do also think that we ought to maintain the established file naming conventions (e.g. genome_main.fna) where they suffice (you probably weren't suggesting that we overturn them, just wanted to be clear about it...)

StevenCannon-USDA commented 1 month ago

"we ought to maintain the established file naming conventions (e.g. genome_main.fna)"

Yes, for sure.

This conversation arose regarding the diversity collections, which have been minimally specified to-date. It is possible that the extra field(s) would only be used for that file type -- but I could imagine them being used for others such as synteny tracks, expression data, etc.

StevenCannon-USDA commented 2 weeks ago

OK, I have generated (provisional) MANIFEST files for all of the Glycine/max/diversity collections. The intent is for those to be tracked as metadata, so I have modified the .gitignore to ignore the two-file MANIFESTS elsewhere through the Data Store. Specifically, these are ignored:

  MANIFEST.*.correspondence.yml
  MANIFEST.*.descriptions.yml

... while this one is tracked:

  MANIFEST.Wm82.gnm1.div.Hu_Zhang_2020.yml
StevenCannon-USDA commented 2 weeks ago

After discussion at LIS/PB/SB meeting today, I have changed the display field to applications, and implemented it for the Glycine/max/diversity collections. @nathanweeks @adf-ncgr

Here is an approximate specification -- which I'll put in place at the datastore-specifications repository pending discussion here:


For each Data Store collection, a single file MANIFEST.collection_name.yml will be used to provide basic information about data files in the collection.

The MANIFEST file must include, for each data file (bgzipped or in rare special cases gzipped), the name of the file and a description of the file. The MANIFEST file should NOT include index files (e.g. gz.tbi or .gz.fai) and should NOT include other metadata files.

Optional additional fields: applications, with yaml array of one or more applications that should use the indicated file; and prior_names, with yaml array of one or more previous names for the indicated file.

If no application in LegumeInfo/PeanutBase/SoyBase directly consumes a file, "applications" should not be specified (in that case, omit this field). In some cases, an application discovers certain files by other means (genome_main.fna and gene_models_main.gff3); in those cases, an "applications" field should also not be specified.

The file must be valid yaml, as tested with yamllint or equivalent.

Example:

cat Wm82.gnm1.div.Hu_Zhang_2020/MANIFEST.Wm82.gnm1.div.Hu_Zhang_2020.yml 
---
- name: glyma.Wm82.gnm1.div.Hu_Zhang_2020.SNPdata.hmp.gz
  description: Genotype information for 96 wild soybean accessions in hapmap format
  prior_names:
    - glyma.Wm82.gnm1.div.WKJG.SNPdata.hmp.gz
    - 96w_used_tassel.hmp.txt
- name: glyma.Wm82.gnm1.div.Hu_Zhang_2020.SNPdata.vcf.gz
  description: Genotype information for 96 wild soybean accessions in VCF format
  applications:
    - jbrowse
  prior_names:
    - glyma.Wm82.gnm1.div.WKJG.SNPdata.vcf.gz
    - 96w_used.vcf.gz