theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
36 stars 17 forks source link

[TheiaCoV_ONT] workflow fails when assembly cannot be produced #271

Closed emmadoughty closed 1 month ago

emmadoughty commented 8 months ago

:bug:

:pencil: Describe the Issue

When an assembly cannot be produced, TheiaCoV_ONT fails to complete the workflow successfully

:repeat: How to Reproduce

:fishing_pole_and_fish: Expected Behavior

Workflow should be run successfully, but information that the assembly could not be created (and ideally why it could not be created) should be output to the Terra data table. Workflow failures indicate that if the workflow was run with different parameters, it would succeed. This is not the case here.

:floppy_disk: Version Information

TheiaCoV_ONT, main (27-11-23)

sage-wright commented 8 months ago

Help! I don't think I understand what the exact issue is here. If an assembly wasn't produced, isn't a failure a clear sign that the workflow failed? If it succeeds, people won't always look to see if an assembly was created and will run into issues later.

emmadoughty commented 8 months ago

I have always interpreted it that workflow failure means there is an issue with the workflow or the way it was set up, and that sample data quality issues should be identified later during QC.

Here, the workflow failed because of an issue with the sequencing data, but given that the workflow fails, this may lead the user to spend a lot of time trying to troubleshoot and re-running the workflow to try and get it to succeed, which it won't.

Instead, it would be better for the workflow to succeed showing the user that they don't need to re-run the workflow, and to communicate the data quality issue to the user via the data table to identify during QC. Perhaps we could use an output column that communicates the issue, like we do with the read screening output columns- maybe a catch-all "QC_notes" column that could communicate all/any tasks that cannot be completed.

emmadoughty commented 8 months ago

Alternatively, users could see an issue with producing the assembly without another output column, by the workflow succeeding and making QC metrics reflect this lack of assembly, for example the assembly length/percent genome coverage being 0

sage-wright commented 8 months ago

Potential solution:

kapsakcj commented 2 months ago

I have 1 sample we could use for testing a solution. This sample was not assembled by IRMA, leading to workflow failure.

Location in Terra (and samplename) is noted here: https://www.notion.so/theiagen/TheiaCoV-Flu-improvements-d8de686d4e384c05b647e2c82c28f187?pvs=4#02da41c090ea41bfacac0ad7c05cbdc6