microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
4 stars 3 forks source link

Develop NMDC to NCBI export code #503

Closed aclum closed 1 week ago

aclum commented 2 months ago

The goal is to develop an ETL script to convert NMDC submissions to NCBI submissions using version 6.0 MIxS packages. We will start by developing support for the following packages:

tasks code should accomplish:

cc @sujaypatil96 @chienchi

cmungall commented 2 months ago

This code should be written in as generic a way as possible. Our metadata should be a superset of the NCBI fields. There should not even be any mapping involved since we use the mixs slots. There is standard generic flattening of object fields (e.g. {value} {unit}).

One decision needs to be made: MIxS loosely recommends units like "centimeters" which is highly non-standard. I proposed using UCUM in 2021 https://github.com/GenomicsStandardsConsortium/mixs/issues/154 but as far as I know this has never been discussed by the CIG or the board.

In NMDC we are moving towards unit symbols/UCUM.

Note that most data that is in NCBI biosample uses unit symbols. I propose that we do not do some kind of awkward expansion, and that we simply submit "5 m" and hope that MIxS catches up.

aclum commented 2 months ago

@cmungall the data in mongo is heterogeneous currently wrt units. I don't want this work to be blocked on UCUM adoption since that is months away. Is your proposal that the export code handle converting to UCUM or that we submit units as they are currently in the schema or something else?

cmungall commented 2 months ago

Correct, I propose that we do not do any kind of awkward expansion, and submit as-is. This means that strictly speaking we are going against MIxS guidelines but hopefully this is temporary.

sujaypatil96 commented 2 months ago

From the list of classes mentioned in the issue description - Biosample, Extraction, LibraryPreparation, OmicsProcessing classes, DataObject, we see attributes from Biosample being mapped to XML attributes in <BioSample>, attributes from DataObject being mapped to XML attributes in <AddFiles> and in addition attributes from Study being mapped to XML attributes in <BioProject> (in submission.xml)

What would attributes from the lab processing classes/slots map to?

aclum commented 1 month ago

In progress, moving to the next sprint.

sujaypatil96 commented 1 month ago

Checkpoint for squad meeting on 5/7: NMDC object/NCBI submission.xml mappings identified, and start of dagster harness set up. We have mappings ready to produce block and block in NCBI submission.xml.

aclum commented 1 month ago

Actively in progress, moving to the next sprint.

ssarrafan commented 1 month ago

@chienchi @sujaypatil96 @aclum is this still actively being worked on? Any updates?

ssarrafan commented 1 month ago

Removing from sprint, no updates in 2 weeks, no response

sujaypatil96 commented 1 month ago

@ssarrafan I am actively working on developing the code for this issue, could we add this to the next sprint board please? There were a couple of blockers which needed some conversations, but work is being pushed up very actively on the linked PR.

sujaypatil96 commented 3 weeks ago

Related issues:

aclum commented 2 weeks ago

Active, moving to the next sprint.