microbiomedata / issues

public repo for issues related to NMDC work
1 stars 0 forks source link

migrate creation of omics_processing_set records to workflows #132

Open aclum opened 1 year ago

aclum commented 1 year ago

Based on discussions in the 3/15/23 NMDC sync meeting the workflows team will take over creation of the omics_processing_set records for ingest of JGI data using GOLD. In order to tell these records apart if samples were processed in the lab more than once for the same JGI sequencing project NMDC will use has_outputs to specify an NMDC data object identifier for the raw sequencing file(s). @sujaypatil96 has code in the sample-annotator repo which populates most of the omics_processing_set records with metadata from the GOLD API projects endpoint. What needs to change is the point in the process where this happens, Sujay will no longer be creating these records b/c he doesn't know how many to make, data object identifiers need to be minted for the raw files before the creation of the omics_processing_set record and the code to create the omics_processing_set record needs to be updated to include the data object for the raw data.

Currently the JGI data portal will return all raw data even if it wasn't part of the analysis. Filtered data is correctly associated and the prefix of that file name can be used to determine the correct raw record. For example: GOLD sequencing project Gp0503309 which has the following two analysis projects Ga0451494 and Ga0485314 The filtered data for Ga0451494 is 52433.4.332751.AAGAAGGC-GCCTTCTT.filter-METAGENOME.fastq.gz so the correct raw data file is 52433.4.332751.AAGAAGGC-GCCTTCTT.fastq.gz. The filtered data for Ga0485314 is 52554.1.382511.AAAGGCGT-GGAGTTGA.filter-METAGENOME.fastq.gz so the correct raw data file is 52554.1.382511.AAAGGCGT-GGAGTTGA.fastq.gz. Metatranscriptomes will have to be handled differently as they don't have analysis project identifiers for the filtered data. This can be grepped out of the README but that seems brittle and perhaps we should wait until there is better support from the JGI. An interim solution would be to provide that information offline based on a jamo query.

cc @scanon @Michal-Babins

aclum commented 1 year ago

JGI's data portal now has the sequencing reads inputs in the json return document. @sujaypatil96 to meet with @scanon @Michal-Babins and @mflynn-lanl in the coming weeks to determine the best way to generate omics processing records for data coming from JGI/GOLD

ssarrafan commented 1 month ago

@aclum @hubin-keio @sujaypatil96 does this still need to happen? Was there a meeting? Can we close this one?

aclum commented 1 month ago

This is still outstanding and is needed to handle projects with multiple sequencing libraries.

ssarrafan commented 1 month ago

This is still outstanding and is needed to handle projects with multiple sequencing libraries.

Should this be re-assigned to someone else?

aclum commented 1 month ago

I'll move this to nmdc automation and reassign to Michael.