jts / ncov-tools

Small collection of tools for performing quality control on coronavirus sequencing data and genomes
MIT License
47 stars 16 forks source link

KeyError in `format_pileup.py` #110

Open dfornika opened 6 months ago

dfornika commented 6 months ago

We've seen an error in format_pileup.py where a KeyError can be triggered here:

https://github.com/jts/ncov-tools/blob/7a19778c644594fe17ce4fc703560e97b149aa60/workflow/scripts/format_pileup.py#L23

Traceback (most recent call last):
  File "ncov-tools/workflow/rules/../scripts/format_pileup.py", line 23, in <module>
    freqs[b] += 1
KeyError: 'N'

...because the dict isn't initialized with an N key:

https://github.com/jts/ncov-tools/blob/7a19778c644594fe17ce4fc703560e97b149aa60/workflow/scripts/format_pileup.py#L15

dfornika commented 6 months ago

The error appears to occur when a sample includes reads that have N bases. We ran into this issue when setting up some test data for a CI pipeline.

https://github.com/BCCDC-PHL/ncov-tools-nf/tree/9d10033fee0fb0fa75d4cc8d8c7ddedc51522e24/.github/data/fastqs

Sample SRR27503680-2-25x from the link above (derived from SRA sample SRR27503680 includes reads like this:

ATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCACGTGCTGGTANNNN

...which trigger the error.

rdeborja commented 6 months ago

Added N key in dictionary: https://github.com/jts/ncov-tools/blob/3028cc610830c13e0db1a1bce464e0feeb382fdf/workflow/scripts/format_pileup.py#L15

See branch fix/metadata-na. If you're happy with the results I'll merge the code and do a release.

hgibling commented 3 months ago

For what it's worth, the team I work with encountered this issue a while back, made the same change, and was able to run ncov-tools on the problematic sample with this solution.