theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
33 stars 15 forks source link

[Freyja_Plot] -- Fail gracefully with low coverage samples #329

Open kevinlibuit opened 4 months ago

kevinlibuit commented 4 months ago

:cool:

:pushpin: Explain the Request

Freyja Plot will filter samples out based on the min_cov input value. If all the data in a set have coverages below this threshold, all of these samples get filtered out leaving nothing for the Freyja workflow to plot. This results in an error as Freyja plot works to parse the empty aggreated file:

Traceback (most recent call last):
File "/opt/conda/envs/freyja-env/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3652, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'summarized'

The workflow then fails with no obvious information relayed to the user.

Here's a bit more information on coverage from the Freyja github page: The coverage value provides the 10x coverage estimate (percent of sites with 10 or greater reads- 10 is the default but can be modfied using the --covcut option in demix).

:books: Context

Freyja plot failures with a dataset of low-coverages samples

:chart_with_upwards_trend: Desired Behavior

Freya plot fails gracefully with information indicating potential issue.

:information_source: Additional Information

Also raised an issue on the Freyja repo as this could also be something implemented at the Freyja plot level

michellescribner commented 4 months ago

In addition to modifying the Freyja Plot workflow to fail more gracefully, the same underlying user error could also be helped by modifying the Freyja FASTQ workflow to expose the mean coverage value for each sample. It is currently present within the demixed file but could be shown directly in the outputs as well.

joshuailevy commented 4 months ago

Bit of cross posting, but we've already added in a clause to catch these cases here: https://github.com/andersen-lab/Freyja/blob/1fa14df1ad2512cb50620cff4296d6df4107b5e7/freyja/_cli.py#L377 Planning to include this in our next release.

However, there may be a better way for these failures to happen such that users can detect them more easily (if they aren't accessing via terminal, for instance). Happy to modify the failure response if that's of interest!

kevinlibuit commented 4 months ago

Thanks, @joshuailevy.

This is perfect. We should be able to adjust things accordingly with that catch. We'll be on the lookout for the next release!