Closed sujaypatil96 closed 2 months ago
We will need special handling for host_taxid in NMDC.
Example:
host_tax_id: {
{
"has_raw_value": "Homo sapiens [NCBITaxon:9606]",
"term": {
"id": "NCBITaxon:9606",
"name": "Homo sapiens"
}
}
In MIxS and NCBI, we will need to map host_taxid to host_tax_id.term.id
and split it up to get only the integer part, i.e., 9606. The other term host
(MIxS 6.0 term) will need to be retrieved from host_tax_id.term.name
.
We will need special handling for host_tax_id in NMDC.
I was expecting that this would be a common pattern. Is host_tax_id
the only place you've see it so far?
Yup, host_taxid
is the only slot I've noticed this inconsistency on so far. Perhaps you can help me look through the results of automated exact matching and see if I'm missing something? I will have a PR out soon.
Similar to the "NCBI Postgres database column to NMDC slot names" mapping file that we have in
nmdc-schema
: https://github.com/microbiomedata/nmdc-schema/blob/main/assets/ncbi_mappings/ncbi_pg_db_field_mappings_filled.tsv already, we need another declarative mapping file that maps "NMDC slot names to all NCBI BioSample harmonized names".The NCBI BioSample Attributes XML file can be found here: https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/
The declarative mapping file, which will also be a TSV file similar to _ncbi_pg_db_field_mappingsfilled.tsv will map NMDC schema slots (from classes Biosample, Extraction, LibraryPreparation, OmicsProcessing, DataObject) to values in the harmonized_name/synonym/title column from the above table. Since the range of some of the measurement slots in NMDC schema follow an object pattern (as opposed to scalar values in NCBI and MIxS) we need to compose the slots using the dot notation on slots.
Example:
Generic handling for nested data types in NMDC schema (to compose flat NCBI BioSample field values):
{slot_name}.has_raw_value{slot_name}.has_raw_value{slot_name}.has_raw_value{slot_name}.has_raw_value_{slotname} is the name of a slot in the NMDC schema.
Note: This is the first of a series of tasks that need to be completed in order to address https://github.com/microbiomedata/nmdc-runtime/issues/503