microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

Create declarative mapping file for NMDC-to-NCBI push #1940

Closed sujaypatil96 closed 2 months ago

sujaypatil96 commented 3 months ago

Similar to the "NCBI Postgres database column to NMDC slot names" mapping file that we have in nmdc-schema: https://github.com/microbiomedata/nmdc-schema/blob/main/assets/ncbi_mappings/ncbi_pg_db_field_mappings_filled.tsv already, we need another declarative mapping file that maps "NMDC slot names to all NCBI BioSample harmonized names".

The NCBI BioSample Attributes XML file can be found here: https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/

The declarative mapping file, which will also be a TSV file similar to _ncbi_pg_db_field_mappingsfilled.tsv will map NMDC schema slots (from classes Biosample, Extraction, LibraryPreparation, OmicsProcessing, DataObject) to values in the harmonized_name/synonym/title column from the above table. Since the range of some of the measurement slots in NMDC schema follow an object pattern (as opposed to scalar values in NCBI and MIxS) we need to compose the slots using the dot notation on slots.

Example:

nmdc_schema_class nmdc_schema_slot nmdc_schema_slot_range ncbi_biosample_attribute_name static_value ignore
Biosample depth QuantityValue depth

Generic handling for nested data types in NMDC schema (to compose flat NCBI BioSample field values):

_{slotname} is the name of a slot in the NMDC schema.

Note: This is the first of a series of tasks that need to be completed in order to address https://github.com/microbiomedata/nmdc-runtime/issues/503

sujaypatil96 commented 3 months ago

We will need special handling for host_taxid in NMDC.

Example:

host_tax_id: {
  {
    "has_raw_value": "Homo sapiens [NCBITaxon:9606]",
    "term": {
      "id": "NCBITaxon:9606",
      "name": "Homo sapiens"
  }
}

In MIxS and NCBI, we will need to map host_taxid to host_tax_id.term.id and split it up to get only the integer part, i.e., 9606. The other term host (MIxS 6.0 term) will need to be retrieved from host_tax_id.term.name.

turbomam commented 3 months ago

We will need special handling for host_tax_id in NMDC.

I was expecting that this would be a common pattern. Is host_tax_id the only place you've see it so far?

sujaypatil96 commented 3 months ago

Yup, host_taxid is the only slot I've noticed this inconsistency on so far. Perhaps you can help me look through the results of automated exact matching and see if I'm missing something? I will have a PR out soon.