nf-core / viralrecon

Assembly and intrahost/low-frequency variant calling for viral samples
https://nf-co.re/viralrecon
MIT License
118 stars 108 forks source link

Artic v5 mismatched primer names in artic-ncov2019 repo cause certain amplicons to be erroneously filtered/removed #392

Open Sam-Sims opened 1 year ago

Sam-Sims commented 1 year ago

Description of the bug

Hello!

By default viralrecon pulls the Artic V5 bed file from: https://github.com/artic-network/artic-ncov2019/raw/master/primer_schemes/nCoV-2019/V5.3.2/SARS-CoV-2.scheme.bed

However in this version of the bed file, the primer names are mismatched for the following pairs: SARS-CoV-2_3 SARS-CoV-2_31 SARS-CoV-2_62 SARS-CoV-2_89 SARS-CoV-2_96

This results in reads belonging to those pairs being erroneously filtered out as part of artic minion, and are consequently removed from the primertrimmed.rg.sorted.bam file. This results in amplicons appearing "dropped" for those regions, even though there is coverage. I have outlined this issue in more detail here: https://github.com/artic-network/fieldbioinformatics/issues/126

The current workaround I have found is to manually use the bed file from artic-network/primer-schemes : https://github.com/artic-network/primer-schemes/blob/master/nCoV-2019/V5.3.2/SARS-CoV-2.scheme.bed

This means removing the sequences in the bed file as the collapse_primer_bed.py script expects 6 columns: line 58: chrom, start, end, name, score, strand = line.strip().split("\t")

I think this would affect everyone using the Artic V5 scheme and viralrecon currently. It might be useful to use the bed file in artic-network/primer-schemes for now, and modify collapse_primer_bed.py to handle the sequence column?

Would be happy to open a PR - but think the config needs to be changed in the nf-core/configs repo to modify the download url?

Thanks, Sam

Command used and terminal output

No response

Relevant files

No response

System information

No response