microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

post berkeley-schema-fy24 merge issue: review `FileTypeEnum` composition and correlation with other `DataObject` slots/relationships #2186

Open turbomam opened 1 week ago

turbomam commented 1 week ago

As one example: what are the advantages and disadvantages of generality or specificity in

The same question might apply to other PVs in this enumeration.

low priority for now (in my opinion)

cc @mslarae13 @brynnz22

see also the following label (although we might want to remove it at some point)

for example, we could use a link like this, instead of a lable (berkeley-schema-fy24 in this case)

turbomam commented 1 week ago

Claude finds these different axes of differentiation or concerns in FileTypeEnum:

  1. Data Type / Analysis Method:

    • Metagenome data
    • Metabolomics data (FT ICR-MS, GC-MS)
    • Metaproteomics data
    • Assembly data
    • Annotation data (various types)
    • Read-based analysis
    • Taxonomic classification (GOTTCHA2, Kraken2, Centrifuge)
  2. Processing Stage:

    • Raw data
    • Filtered data
    • Error-corrected data
    • Assembled data
    • Annotated data
  3. File Format:

    • FASTQ
    • BAM
    • FASTA
    • GFF
    • JSON
    • TSV
    • PDF
    • HTML
  4. Sequencing Read Type:

    • Raw Read 1 (forward)
    • Raw Read 2 (reverse)
    • Interleaved paired-end
  5. Quality Control Stage:

    • QC Statistics
    • QC non-rRNA reads
  6. Biological Entity Focus:

    • Protein-related
    • Peptide-related
    • RNA-related (rRNA, tRNA, etc.)
    • Gene-related
  7. Output Type:

    • Report files
    • Statistical files
    • Plot files (heatmap, barplot, Krona plot)
    • Binning results
  8. Annotation Type:

    • Structural annotation
    • Functional annotation
    • Various specific annotation types (e.g., TIGRFam, CRT, Genemark, etc.)
  9. Compression Status:

    • Compressed files (e.g., zip files for bins)
    • Uncompressed files
  10. Workflow Stage:

    • Intermediate files
    • Final output files
    • Workflow statistics