nih-cfde / ontologyWG

1 stars 0 forks source link

Ontology questions from Jared from GTEx #4

Open nsuvarnaiari opened 3 years ago

nsuvarnaiari commented 3 years ago

From Jared @Jared Nedzel :

Folks, I’m working on the GTEx C2M2 submission and have some followup ontology questions from our meeting last week. First, I wanted to confirm how we deal with compressed and tar/zip files. If the file is a gz containing a single file, then the format entry that I use is the format of the underlying file. For example, when I have a TSV file that is gzipped, I use format:3475 (TSV). If the file is a tar (or tar.gz) containing multiple different file types, then I use format:3981 (TAR). Format questions:

  1. I can’t find a format for Apache Parquet files. Parquet is a columnar storage format (sort of similar to HDF5).

Data Type questions:

  1. What data type should I use for our gene model file (gencode.v26.CRCh38.genes.gtf). This file describes our collapsed gene model (genes, exons, transcripts, and their coordinates)?
  2. Which data type should I use for our gene expression data? The term “gene expression matrix” (data:3112) is described as “final processed data for a set of hybridizations in a microarray experiment”. We have gene expression matrices (at gene, exon, transcript, and junction levels), but these data are from RNA-seq, not microarray experiments.
  3. What data type should I use for our data dictionaries? We have a sample annotation data dictionary and a subject phenotype data dictionary. These data dictionaries describe the columns in our annotation files.
  4. What data type should I use for our subject phenotype file? I found a data type for sample annotations, but not for subject annotations.
  5. I could not find a data type for our quantitative trait loci files (eQTLs, sQTLs, ieQTLs, etc.). These data are associations between a quantitative trait (e.g., expression of a gene) and a genotype.
  6. What data type for documentation files? I found format for documentation files (format:2330) but not a data type.

Assay Type questions: I couldn’t find an assay type appropriate for quantitative trait loci files. I’m not sure it is appropriate to put an assay type on these files, since these are derived from an association between genotypes and gene expression, which are the underlying assays.

nsuvarnaiari commented 3 years ago

Arthur Brady's response:

Format questions: I can’t find a format for Apache Parquet files. Parquet is a columnar storage format (sort of similar to HDF5). I can't find anything at all on columnar storage. Recommendation: leave null; propose new term

Data Type questions: What data type should I use for our gene model file (gencode.v26.CRCh38.genes.gtf). This file describes our collapsed gene model (genes, exons, transcripts, and their coordinates)?

Which data type should I use for our gene expression data? The term “gene expression matrix” (data:3112) is described as “final processed data for a set of hybridizations in a microarray experiment”. We have gene expression matrices (at gene, exon, transcript, and junction levels), but these data are from RNA-seq, not microarray experiments.

What data type should I use for our data dictionaries? We have a sample annotation data dictionary and a subject phenotype data dictionary. These data dictionaries describe the columns in our annotation files.

What data type should I use for our subject phenotype file? I found a data type for sample annotations, but not for subject annotations.

I could not find a data type for our quantitative trait loci files (eQTLs, sQTLs, ieQTLs, etc.). These data are associations between a quantitative trait (e.g., expression of a gene) and a genotype.

What data type for documentation files? I found format for documentation files (format:2330) but not a data type.

Assay Type questions: I couldn’t find an assay type appropriate for quantitative trait loci files. I’m not sure it is appropriate to put an assay type on these files, since these are derived from an association between genotypes and gene expression, which are the underlying assays.

general flow is:

(or whoever the current ontology WG leads happen to be)

final notes: