microbiomedata / issues

public repo for issues related to NMDC work
2 stars 1 forks source link

Milestone - Deploy support for automated data staging non JGI/EMSL data (1.6) #867

Closed ssarrafan closed 3 months ago

ssarrafan commented 3 months ago

Automated workflow execution to support strategic partners For environmental microbiome data generated outside of the two DOE User Facilities to be incorporated into the NMDC Data Portal, all workflows must be executed on the raw multi-omics data. This involves running NMDC workflows on computing infrastructure managed by external partner sites, including NEON and NASA’s GeneLab (see Letters of Support). These two partnerships are teaching us how to build the capabilities necessary to engage a wider audience. The NMDC must automate data staging, submission, and all post-processing workflow outputs, along with high performance data transfer mechanisms to facilitate data movement between external resources and the central NMDC infrastructure. We will implement a solution to enable automated data staging (Figure 4, D;Milestone 1.6), workflow execution (Figure 4, B; Milestone 1.7), and a trigger for pushing processed data to the Data Portal via the NMDC runtime orchestration services (Figure 4, G; Milestone 1.8), effectively expanding the capabilities developed for User Facility data to support the additional needs for a broader set of environmental microbiome data. As described above for Submission Portal integration across the JGI & EMSL, strategic partners will similarly leverage the Submission Portal for metadata collection and validation.

@emileyfadrosh who should own this? are we still doing this?

mslarae13 commented 3 months ago

@shreddd @aclum add summary & close?

ssarrafan commented 3 months ago

Planning meeting notes from today: Shreyas - this is already in place. Alicia - we already did this for the NEON data Pulled data in externally through ETL script, ran workflows and the data is now in the data portal. This has been demonstrated by the NEON data that's in the data portal now. WDL supports URL based access now. ETL scripts are taking advantage of new technology. This has been demonstrated for NEON and TRI P. @aclum will close the ticket with more details.

aclum commented 3 months ago

We have 4 studies in NMDC's data portal (NEON soil, benthic and surface water studies, and TRiP which demonstrate) bring in non user facility data. These studies use ETL scripts to generate records in mongo which the workflows use to trigger work. Workflows for these studies have been run, leveraging cromwell's native support for http files and/or data staging tasks, and ingested into the data portal via standard workflow automation and data ingest processes.