Open nsuvarnaiari opened 3 years ago
Arthur Brady's response:
Format questions: I can’t find a format for Apache Parquet files. Parquet is a columnar storage format (sort of similar to HDF5). I can't find anything at all on columnar storage. Recommendation: leave null; propose new term
Data Type questions: What data type should I use for our gene model file (gencode.v26.CRCh38.genes.gtf). This file describes our collapsed gene model (genes, exons, transcripts, and their coordinates)?
Which data type should I use for our gene expression data? The term “gene expression matrix” (data:3112) is described as “final processed data for a set of hybridizations in a microarray experiment”. We have gene expression matrices (at gene, exon, transcript, and junction levels), but these data are from RNA-seq, not microarray experiments.
What data type should I use for our data dictionaries? We have a sample annotation data dictionary and a subject phenotype data dictionary. These data dictionaries describe the columns in our annotation files.
What data type should I use for our subject phenotype file? I found a data type for sample annotations, but not for subject annotations.
I could not find a data type for our quantitative trait loci files (eQTLs, sQTLs, ieQTLs, etc.). These data are associations between a quantitative trait (e.g., expression of a gene) and a genotype.
What data type for documentation files? I found format for documentation files (format:2330) but not a data type.
Assay Type questions: I couldn’t find an assay type appropriate for quantitative trait loci files. I’m not sure it is appropriate to put an assay type on these files, since these are derived from an association between genotypes and gene expression, which are the underlying assays.
general flow is:
find and use perfect term. on success: stop. on failure:
find and use unsatisfyingly general ancestor term that (a) doesn't clash with the desired concept and (b) isn't a wholly uselessly general root node like "data" or "thing"; use your discretion as to where to give up, but try to be generous. on success: (1) use unsatisfying ancestor term in your submission; (2) submit proposed new (sufficiently specific) term to @Michelle Giglio and @Philippe Rocca-Serra*; (3) stop. on failure:
leave field null and (unless the field is straight-up inapplicable to the record, as with assay_type just above) submit proposed new term (including a lineage, if you have the cycles and think it's a good idea to do that) to @Michelle Giglio and @Philippe Rocca-Serra*.
(or whoever the current ontology WG leads happen to be)
final notes:
all of my specific answers above end in question marks because your own comfort with the proposed approximations should be the final determining factor in deciding on usage.
this is all surely relevant to the ontology WG, so NBD, but just FYI there's a dedicated C2M2 help channel at #c2m2helpdesk -- i expect it'll be fine for us to do any further needed iteration on term selection in this channel, but if we end up drifting into other subject areas, i'll shift downstream threads over there.
colleagues who know EDAM better than i do, or who have some experience in making similar decisions for your own DCCs' C2M2 submissions, please feel welcome to chime in and edit or rearrange my term suggestions at will.
From Jared @Jared Nedzel :
Folks, I’m working on the GTEx C2M2 submission and have some followup ontology questions from our meeting last week. First, I wanted to confirm how we deal with compressed and tar/zip files. If the file is a gz containing a single file, then the format entry that I use is the format of the underlying file. For example, when I have a TSV file that is gzipped, I use format:3475 (TSV). If the file is a tar (or tar.gz) containing multiple different file types, then I use format:3981 (TAR). Format questions:
Data Type questions:
Assay Type questions: I couldn’t find an assay type appropriate for quantitative trait loci files. I’m not sure it is appropriate to put an assay type on these files, since these are derived from an association between genotypes and gene expression, which are the underlying assays.