microbiomedata / issues

public repo for issues related to NMDC work
2 stars 1 forks source link

Slot descriptions and fields standardization #705

Closed mslarae13 closed 4 months ago

mslarae13 commented 4 months ago

As part of metabolink2, @anastasiyaprymolenna , @brynnz22 , & @kheal make some new classes and slots or did some re-organizing. This issue is looking at these changes and determining if there's a standardized term that should be used

mslarae13 commented 4 months ago

https://github.com/microbiomedata/berkeley-schema-fy24/pull/169

mslarae13 commented 4 months ago

@kheal

I'm looking at.

ionization_source:
    range: IonizationSourceEnum
    description: which kind of ionization source is used to introduce analytes into a mass spectrometer

I'm looking at https://www.ebi.ac.uk/ols4/ontologies/ms/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMS_1000008 & all the childran.

Please confirm if this is what you're looking for!

Also, if the children are correct for the enum. Also, should athmospheric_pressure_chemical_ionization be `atmospheric_pressure_chemical_ionization

kheal commented 4 months ago

Yes, that is a direct mapping. Yes, please fix that typo! Good catch

mslarae13 commented 4 months ago

@kheal another one!

mass_analyzers:
    range: MassAnalyzerEnum
    description: which kind of mass analyzer(s) used during the spectra collection.
    multivalued: true

Looks like https://www.ebi.ac.uk/ols4/ontologies/ms/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMS_1000443

Confirm?

kheal commented 4 months ago

Yes, another direct mapping. You are a much better ontology detective than I am.

mslarae13 commented 4 months ago
resolution_categories:
    range: ResolutionCategoryEnum
    description: whether mass spectra are collected at higher than unit resolution
    multivalued: true

This description reads weird to me... the enum has high & low where you can enter values.

Whether mass spectra sounds like a yes or no?

Is this meant to be

"detector resolution: The resolving power of the detector to detect the smallest difference between two ions so that the valley between them is a specified fraction of the peak height. "

Or

"mass resolution: Smallest mass difference between two equal magnitude peaks so that the valley between them is a specified fraction of the peak height."

Or neither?

@kheal :)

mslarae13 commented 4 months ago
    range: AcquisitionStrategyEnum
    description: whether spectra are dependent on or independent of any other spectra during the experiment

Need to chat with Mark about the ontology structure for https://www.ebi.ac.uk/ols4/ontologies/ms/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMS_1003213?lang=en

@kheal this ontology term has many independent children types

data-independent acquisition (7) data independent acquisition from dissociation of full mass range data independent acquisition from dissociation of full mass range after ion mobility separation data independent acquisition from dissociation of scanning quadrupole across mass range data independent acquisition from dissociation of sequential mass ranges data independent acquisition from dissociation of sequential mass ranges after ion mobility separation

Does NMDC care about these? I assume just independent or dependent is sufficient?

kheal commented 4 months ago

re: resolution_categories slot. I agree re: the word whether, that indicates a bool range. Since there are only two, we could swap to a bool and get rid of that Enum or we should modify the description. The slot and Enum were designed by Sam and Yuri so I didn't attempt to change it as part of the MassSpecConfig work, but I actually think it would be better as a bool. But I don't know if bool slots can be multivalued (both True and False, which we would need here)!

Neither of those mappings are direct, I think the first is an indirect mapping. Those mappings would require a float range, we are just trying to capture high or low within that range.

kheal commented 4 months ago

@kheal this ontology term has many independent children types

data-independent acquisition (7) data independent acquisition from dissociation of full mass range data independent acquisition from dissociation of full mass range after ion mobility separation data independent acquisition from dissociation of scanning quadrupole across mass range data independent acquisition from dissociation of sequential mass ranges data independent acquisition from dissociation of sequential mass ranges after ion mobility separation

Does NMDC care about these? I assume just independent or dependent is sufficient?

Those are important, but as far as I know unused with the current data in NMDC. The description data independent acquisition from dissociation of full mass range is the "classic" DIA and would capture the most common type of DIA data we would get for proteomics, so I suggest that one as our description for the exisiting permissible value. The others we could add later as different permissible values if needed.

mslarae13 commented 4 months ago

Update

eluent_introduction_category: description: A high-level description for how the processed sample is introduced into a mass spectrometer. to "high level categorization" from Mark

mslarae13 commented 4 months ago

Update

has_mass_spectrometry_configuration: description: An identifier for the associated MassSpectrometryConfiguration.

to "The identifier of..."

mslarae13 commented 4 months ago

Update

has_chromatography_configuration: description: An identifier for the associated ChromatographyConfiguration, providing information about how a sample was introduced into the mass spectrometer.

To I suggest "The identifier of..."

mslarae13 commented 4 months ago

@turbomam how do I know what files are generated vs what I should edit for changing a slot?

Katherine has a typo, athomispheric should be atmospheric.

Do I need to update all files? Like enum_pv_results.tsv? Or just .yaml files?

kheal commented 4 months ago

Just the yaml, where the Enum is defined and the values are listed.

mslarae13 commented 4 months ago

@turbomam

See https://github.com/microbiomedata/issues/issues/705#issuecomment-2125316993

Katherine has confirmed that our term mass_analyzers has an exact mapping to https://www.ebi.ac.uk/ols4/ontologies/ms/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMS_1000443

The mass analyzers listed under this term match what we have in MassAnalyzerEnum and has more. Should we provide a way that says "this enum is a subset from MS_1000443 ? Or should we provide mappings for each of these enum permissible values as they match the ontology?

mslarae13 commented 4 months ago

@kheal where is has_calibration defined in the schema? https://github.com/microbiomedata/berkeley-schema-fy24/blame/705-metabolink-standard-definitions/src/schema/nmdc.yaml#:~:text=has_calibration-,%2D,has_calibration,-last%20month

mslarae13 commented 4 months ago

@corilo @kheal

For acquisition_category we have "whether only survey or tandem mass spectra are part of the experiment"

For it's range: AcquisitionCategoryEnum we have

      full_scan:
        aliases:
          - MS      
      tandem_mass_spectrum:
        aliases:
          -MSn
mslarae13 commented 4 months ago

Suggested resolution for https://github.com/microbiomedata/issues/issues/705#issuecomment-2125365745

" AcquisitionStrategyEnum in nmdc berk schema for metabolink. I found this term in bioportal Which seems to fit for acquisition_strategy And it has children data-dependent and data-independent acquisition. I asked Katherine, and she said that the data dependent child fits, and for the children of data-independent acquisition "data independent acquisition from dissociation of full mass range is the "classic" DIA and would capture the most common type of DIA data we would get for proteomics, so I suggest that one as our description for the exisiting permissible value" I want to be sure I'm referencing the ontology correctly in our schema... 8:41 So... 8:44 I'd add MS:1003213 as an exact_mapping for acquisition_strategy .. I'd also like to update the description... "Mode of running a mass spectrometer method by which mass ranges are selected and possibly dissociated." seems better than what we currently have "whether spectra are dependent on or independent of any other spectra during the experiment" ... I hate whether in the description, makes me feel like it's T//F and not a value And for the enum AcquisitionStrategyEnum I can add the descriptions and a mapping for data_independent_acquisition == MS:1003227 & data_dependent_acquisition == MS:1003221 "

From @turbomam

" Is it possible that we would want to use the acquisition_strategy slot in other classes in the future? Do we have other classes that might acquire things? Would we need to document they strategy they used? I'm a little concerned about over-specializing slots with such generic names. One solution would be to make a ms_acquisition_strategy child slot and specialize it. Or just rename acquisition_strategy to ms_acquisition_strategy . "

Decision: Change acquisition_strategy to acquisition_strategy_ms & add the ontology mappings

@kheal @corilo @anastasiyaprymolenna @SamuelPurvine

kheal commented 4 months ago

@kheal where is has_calibration defined in the schema? https://github.com/microbiomedata/berkeley-schema-fy24/blame/705-metabolink-standard-definitions/src/schema/nmdc.yaml#:~:text=has_calibration-,%2D,has_calibration,-last%20month

has_calibration is defined here: https://github.com/microbiomedata/berkeley-schema-fy24/blob/8a3f51b16c725000b06b5a5bc4b21824c2d48144/src/schema/workflow_execution_activity.yaml#L527 . We are aware this is an odd placement, but we need it in this yaml until ingest - see this PR for follow ups: https://github.com/microbiomedata/berkeley-schema-fy24/pull/133

kheal commented 4 months ago

Decision: Change acquisition_strategy to acquisition_strategy_ms

I have no problem changing acquisition_strategy to be more specific and I see @turbomam 's point! I have a slight preference to spell out mass_spectrometry bc I don't think ms is a well known acronym.

If we move forward with this, we should update all these names: acquisition_strategy -> mass_spectrometry_acquisition_strategy AcquisitionCategoryEnum -> MassSpectrometryAcquisitionCategoryEnum acquisition_category -> mass_spectrometry_acquisition_category AcquisitionStrategyEnum -> MassSpectrometryAcquisitionStrategyEnum

Now is the time to do this bc we don't need a migration since these classes are not ingested yet.

kheal commented 4 months ago

@corilo, @SamuelPurvine et al agreed to collapse these slots for simplicity:

  1. Delete acquisition_category slot and AcquisitionCategoryEnum
  2. Add full scan only as a permissible value to AcquisitionStrategyEnum

Resolves https://github.com/microbiomedata/issues/issues/705#issuecomment-2140081596 See https://github.com/microbiomedata/nmdc-schema/issues/2018