Closed aclum closed 6 months ago
Adding this to the sprint b/c Sujay finished the surface water work.
Moving to next sprint. @aclum @sujaypatil96
@aclum how do we write a query to get a list of all individual samples?
Check for records where the NEON's mms_metagenomeDnaExtraction.dnaSampleID
is not the name
of a NMDC processed_sample_set
document and the mms_metagenomeDnaExtraction.sequenceAnalysisType
equals ( marker gene and metagenomics
OR metagenomics
)
This issue needs to be moved to the next sprint.
Check for records where the NEON's
mms_metagenomeDnaExtraction.dnaSampleID
is not thename
of a NMDCprocessed_sample_set
document and themms_metagenomeDnaExtraction.sequenceAnalysisType
equals (marker gene and metagenomics
ORmetagenomics
)
I still don't think I fully follow this @aclum. I generated a list by getting a list of all the dnaSampleID values from a combined mms_metagenomeDnaExtraction
table (from NEON) and checking which dnaSampleID from that list is/is not in processed_sample_set
. Here is that list: neon_soil_individual_sample_dnaSampleID.txt
Is that correct?
According to the documentation (NEON_metagenomes_userGuide_vE.pdf):
The individual samples used to generate the pooled metagenomics samples are found as a pipe‐delimited string in the field genomicsPooledIDList located in the data table _slsmetagenomicsPooling, which is part of the Soil Physical Properties (distributed periodic) data product (DP1.10086).
There are no extraction, library preparation, omics processing records associated with any "individual" biosamples that are part of pooling processes?
discussed over slack as well, this list pulls in samples which have got through pooling. Updated requirements are below:
join on mms_metagenomeSequencing.dnaSampleID=mms_metagenomeDnaExtraction.dnaSampleID, join mms_metagenomeDnaExtraction.genomicsSampleID=sls_metagenomicsPooling.genomicsSampleID. If a dnaSampleID in mms_metagenomeSequencing does not track back to a record from the pooling table we need to generate the missing records. To do this join mms_metagenomeSequencing.dnaSampleID=mms_metagenomeDnaExtraction.dnaSampleID to get mms_metagenomeDnaExtraction.genomicsSampleID, join mms_metagenomeDnaExtraction.genomicsSampleID=sls_soilCoreCollection.geneticSampleID to get sls_soilCoreCollection.sampleID. Use sls_soilCoreCollection.sampleID=nmdc.biosampel_set.name to get the NMDC biosample ID. This NMDC ID is has_input to an extraction_set record, which generates a processed sample, which is has_input to library_preparation, which generates a processed sample which is has_input to the omics_processing_set record which has_output data objects
discussed over slack as well, this list pulls in samples which have got through pooling. Updated requirements are below:
join on mms_metagenomeSequencing.dnaSampleID=mms_metagenomeDnaExtraction.dnaSampleID, join mms_metagenomeDnaExtraction.genomicsSampleID=sls_metagenomicsPooling.genomicsSampleID. If a dnaSampleID in mms_metagenomeSequencing does not track back to a record from the pooling table we need to generate the missing records. To do this join mms_metagenomeSequencing.dnaSampleID=mms_metagenomeDnaExtraction.dnaSampleID to get mms_metagenomeDnaExtraction.genomicsSampleID, join mms_metagenomeDnaExtraction.genomicsSampleID=sls_soilCoreCollection.geneticSampleID to get sls_soilCoreCollection.sampleID. Use sls_soilCoreCollection.sampleID=nmdc.biosampel_set.name to get the NMDC biosample ID. This NMDC ID is has_input to an extraction_set record, which generates a processed sample, which is has_input to library_preparation, which generates a processed sample which is has_input to the omics_processing_set record which has_output data objects
@ssarrafan needs to be moved to next sprint, it will not be complete by 2/23.
Moving another one to the next sprint @aclum
The records that need to be fixed in the NEON soil data product:
Individual samples that have already been ingested into NMDC mongo (create Extraction, LibraryPreparation, OmicsProcessing) - individual_samples_in_db.csv Individual samples that haven't been ingested into NMDC mongo yet (create Biosample and downstream records for these) - individual_samples_not_in_db.csv
@sujaypatil96 Can you check the logic? ABBY_004-M-20170605-COMP
should not be in either of these lists and it is listed in individual_samples_not_in_db.csv. That is a composite sample. You can find that name in NEON.D16.ABBY.DP1.10086.001.sls_metagenomicsPooling.2017-06.*.csv and is a pool of these three biosamples ABBY_004-M-12-34-20170605|ABBY_004-M-32-14.5-20170605|ABBY_004-M-0.5-8-20170605
@aclum ABBY_004-M-20170605-COMP
is a weird one that stayed in because we didn't apply post-query record removal/filtering logic (for ex. full duplicates removal). If you look at that row it doesn't even have a sampleID
(Biosample id
) associated with it, so we should be good to remove that.
Here is the first JSON file with Extraction
, LibraryPreparation
, OmicsProcessing
and DataObject
records for biosamples that are are already in the database.
neon_individual_samples_in_db.json
I'm still working on making the JSON file for the second set of records, i.e., individual samples which do not have Biosample (and downstream) records in the database.
This issue will be ready to be closed once those two JSON files are ready, have been reviewed and submitted to Mongo.
This issue will need to be moved to the next sprint @ssarrafan. I see it getting done in the first couple of days of the next sprint.
Here is the JSON file that has been created for individual samples that are not in the database. We have created biosample, extraction, library preparation, omics processing and data object instances for these records. neon_individual_samples_not_in_db.json
Both the JSON files linked in this issue have been submitted and ingested into Mongo (confirmed/verified by looking for these records in the database). I think this issue is ready to be closed @aclum.
Deliverable this task is associated with
See Deliverables tab here:
RACI
Tag people in their roles
Describe the the task
Criteria for completion
Estimate people time
Completion Date (Goal)
Target Sprint Start & End Dates
Tag Blocker/Contingent upon issues