microbiomedata / nmdc-metadata

Managing metadata and policy around metadata in NMDC
https://microbiomedata.github.io/nmdc-schema/
Other
2 stars 0 forks source link

document and map schema elements for workflows and outputs #244

Open cmungall opened 3 years ago

cmungall commented 3 years ago

In the run up to GSP we rapidly added many classes and fields to be able to accomodate all omics workflows and outputs. Post GSP we should return to this and fully document these, and map to existing ontologies where possible.

In some cases we have good documentation, but we still need to map to existing standards/ontologies; e.g

  min_q_value:
    description: >-
      smallest Q-Value associated with the peptide sequence as provided by MSGFPlus tool
    range: float
  peptide_spectral_count:
    description: >-
      sum of filter passing MS2 spectra associated with the peptide sequence within a given LC-MS/MS data file
    range: integer
  peptide_sum_masic_abundance:
    description: >-
      combined MS1 extracted ion chromatograms derived from MS2 spectra associated with the peptide sequence from a given LC-MS/MS data file using the MASIC tool
    range: integer

these terms could be mapped to metaP standards from PSI, or potentially OBI

We should create an enum for data object type and have the values mapped to the appropriate ontology. We need to determine if OBI, EDAM, or SWO is appropriate here.

There are a lot of unmapped fields for metagenome assembly and MAGs.

  scaf_logsum:
    is_a: metagenome assembly parameter
    description: >-
      The sum of the (length*log(length)) of all scaffolds, times some constant.  Increase the contiguity, the score will increase
    range: float

  scaf_powsum:
    is_a: metagenome assembly parameter
    description: >-
      Powersum of all scaffolds is the same as logsum except that it uses the sum of (length*(length^P)) for some power P (default P=0.25).
    range: float 

  scaf_max:
    is_a: metagenome assembly parameter
    description: >-
      Maximum scaffold length.
    range: float

  scaf_bp:
    is_a: metagenome assembly parameter
    description: >-
      Total size in bp of all scaffolds.
    range: float

  scaf_N50:
    is_a: metagenome assembly parameter
    description: >-
      Given a set of scaffolds, each with its own length, the L50 count is defined as the smallest number of scaffolds whose length sum makes up half of genome size.
    range: float

to be determined if this is in scope for an existing ontology or if the nmdc schema fields can be a proposed new standard

SamuelPurvine commented 3 years ago

I would also very much like to make sure we address the peptide quantification versus/in addition to protein quantification, whether these can/should be separate JSON entities or collapsed/grouped as you were suggesting. Once we have actual results with which to play it ought be easier to suss out the best path.