theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
36 stars 17 forks source link

[New workflow] Workflow to create Terra table after data upload to Google bucket #259

Closed emmadoughty closed 2 months ago

emmadoughty commented 9 months ago

:cool:

:pushpin: Explain the Request

Terra users often find it difficult and time-consuming to create the metadata TSV that is required when using the Data Uploader. Ideally, the data uploader would take in data files and automatically identify forward and reverse reads and fasta files. Unfortunately, this is not supported on Terra natively at the moment, so a suggested workaround is to upload the files to a Google bucket via the Terra Data Uploader, then use a workflow to grab the files from the bucket and then use their names to organise them into the Terra table.

:books: Context

Data upload to Terra from a local computer

:chart_with_upwards_trend: Desired Behavior

Create a Terra table after uploading data to a Google bucket, without needing to manually create a metadata TSV

:information_source: Additional Information

Ideas from Danny: WDL task that is given a String input which is a gs bucket directory (not file) -- like gs://fc-/uploads/my-new-collection/ searches that directory (which only contains the "latest batch of freshly uploaded stuff") and does pattern matching logic based on Illumina and ONT filenaming rules produces a 3-col TSV as output

Subsequent WDL task (in a 2-task workflow) optionally, if a destination Terra table name is provided (String): uses GCP-specific API calls to insert-or-update rows in the destination table with the data from the 3-col TSV