microbiomedata / issues

public repo for issues related to NMDC work
2 stars 1 forks source link

NEON - soil metagenome individual samples sequenced #432

Closed aclum closed 6 months ago

aclum commented 1 year ago

Deliverable this task is associated with

See Deliverables tab here:

RACI

Tag people in their roles

Describe the the task

Criteria for completion

Estimate people time

Completion Date (Goal)

Target Sprint Start & End Dates

Tag Blocker/Contingent upon issues

aclum commented 8 months ago

Adding this to the sprint b/c Sujay finished the surface water work.

ssarrafan commented 8 months ago

Moving to next sprint. @aclum @sujaypatil96

sujaypatil96 commented 8 months ago

@aclum how do we write a query to get a list of all individual samples?

aclum commented 8 months ago

Check for records where the NEON's mms_metagenomeDnaExtraction.dnaSampleID is not the name of a NMDC processed_sample_set document and the mms_metagenomeDnaExtraction.sequenceAnalysisType equals ( marker gene and metagenomics OR metagenomics)

sujaypatil96 commented 8 months ago

This issue needs to be moved to the next sprint.

sujaypatil96 commented 7 months ago

Check for records where the NEON's mms_metagenomeDnaExtraction.dnaSampleID is not the name of a NMDC processed_sample_set document and the mms_metagenomeDnaExtraction.sequenceAnalysisType equals ( marker gene and metagenomics OR metagenomics)

I still don't think I fully follow this @aclum. I generated a list by getting a list of all the dnaSampleID values from a combined mms_metagenomeDnaExtraction table (from NEON) and checking which dnaSampleID from that list is/is not in processed_sample_set. Here is that list: neon_soil_individual_sample_dnaSampleID.txt

Is that correct?

sujaypatil96 commented 7 months ago

According to the documentation (NEON_metagenomes_userGuide_vE.pdf):

The individual samples used to generate the pooled metagenomics samples are found as a pipe‐delimited string in the field genomicsPooledIDList located in the data table _slsmetagenomicsPooling, which is part of the Soil Physical Properties (distributed periodic) data product (DP1.10086).

There are no extraction, library preparation, omics processing records associated with any "individual" biosamples that are part of pooling processes?

aclum commented 7 months ago

discussed over slack as well, this list pulls in samples which have got through pooling. Updated requirements are below:

join on mms_metagenomeSequencing.dnaSampleID=mms_metagenomeDnaExtraction.dnaSampleID, join mms_metagenomeDnaExtraction.genomicsSampleID=sls_metagenomicsPooling.genomicsSampleID. If a dnaSampleID in mms_metagenomeSequencing does not track back to a record from the pooling table we need to generate the missing records. To do this join mms_metagenomeSequencing.dnaSampleID=mms_metagenomeDnaExtraction.dnaSampleID to get mms_metagenomeDnaExtraction.genomicsSampleID, join mms_metagenomeDnaExtraction.genomicsSampleID=sls_soilCoreCollection.geneticSampleID to get sls_soilCoreCollection.sampleID. Use sls_soilCoreCollection.sampleID=nmdc.biosampel_set.name to get the NMDC biosample ID. This NMDC ID is has_input to an extraction_set record, which generates a processed sample, which is has_input to library_preparation, which generates a processed sample which is has_input to the omics_processing_set record which has_output data objects

aclum commented 7 months ago

discussed over slack as well, this list pulls in samples which have got through pooling. Updated requirements are below:

join on mms_metagenomeSequencing.dnaSampleID=mms_metagenomeDnaExtraction.dnaSampleID, join mms_metagenomeDnaExtraction.genomicsSampleID=sls_metagenomicsPooling.genomicsSampleID. If a dnaSampleID in mms_metagenomeSequencing does not track back to a record from the pooling table we need to generate the missing records. To do this join mms_metagenomeSequencing.dnaSampleID=mms_metagenomeDnaExtraction.dnaSampleID to get mms_metagenomeDnaExtraction.genomicsSampleID, join mms_metagenomeDnaExtraction.genomicsSampleID=sls_soilCoreCollection.geneticSampleID to get sls_soilCoreCollection.sampleID. Use sls_soilCoreCollection.sampleID=nmdc.biosampel_set.name to get the NMDC biosample ID. This NMDC ID is has_input to an extraction_set record, which generates a processed sample, which is has_input to library_preparation, which generates a processed sample which is has_input to the omics_processing_set record which has_output data objects

sujaypatil96 commented 7 months ago

@ssarrafan needs to be moved to next sprint, it will not be complete by 2/23.

ssarrafan commented 7 months ago

Moving another one to the next sprint @aclum

sujaypatil96 commented 7 months ago

The records that need to be fixed in the NEON soil data product:

Individual samples that have already been ingested into NMDC mongo (create Extraction, LibraryPreparation, OmicsProcessing) - individual_samples_in_db.csv Individual samples that haven't been ingested into NMDC mongo yet (create Biosample and downstream records for these) - individual_samples_not_in_db.csv

aclum commented 7 months ago

@sujaypatil96 Can you check the logic? ABBY_004-M-20170605-COMP should not be in either of these lists and it is listed in individual_samples_not_in_db.csv. That is a composite sample. You can find that name in NEON.D16.ABBY.DP1.10086.001.sls_metagenomicsPooling.2017-06.*.csv and is a pool of these three biosamples ABBY_004-M-12-34-20170605|ABBY_004-M-32-14.5-20170605|ABBY_004-M-0.5-8-20170605

sujaypatil96 commented 7 months ago

@aclum ABBY_004-M-20170605-COMP is a weird one that stayed in because we didn't apply post-query record removal/filtering logic (for ex. full duplicates removal). If you look at that row it doesn't even have a sampleID (Biosample id) associated with it, so we should be good to remove that.

sujaypatil96 commented 7 months ago

Here is the first JSON file with Extraction, LibraryPreparation, OmicsProcessing and DataObject records for biosamples that are are already in the database. neon_individual_samples_in_db.json

sujaypatil96 commented 7 months ago

I'm still working on making the JSON file for the second set of records, i.e., individual samples which do not have Biosample (and downstream) records in the database.

This issue will be ready to be closed once those two JSON files are ready, have been reviewed and submitted to Mongo.

sujaypatil96 commented 7 months ago

This issue will need to be moved to the next sprint @ssarrafan. I see it getting done in the first couple of days of the next sprint.

sujaypatil96 commented 6 months ago

Here is the JSON file that has been created for individual samples that are not in the database. We have created biosample, extraction, library preparation, omics processing and data object instances for these records. neon_individual_samples_not_in_db.json

sujaypatil96 commented 6 months ago

Both the JSON files linked in this issue have been submitted and ingested into Mongo (confirmed/verified by looking for these records in the database). I think this issue is ready to be closed @aclum.