pjedge / longshot

diploid SNV caller for error-prone reads
MIT License
185 stars 26 forks source link

Problem when running Longshot with too few reads/too small coverage using -A #72

Open Coppini opened 3 years ago

Coppini commented 3 years ago

I'm having problems with Longshot when running the Artic pipeline (even after updating Longshot to v0.4.3, since Artic still uses v0.4.1), in this command:

longshot \
    -P 0 \
    -F -A \
    --no_haps \
    --bam barcode91.primertrimmed.rg.sorted.bam \
    --ref primer_schemes/SARS-CoV-2/V1200/SARS-CoV-2.reference.fasta \
    --out barcode91.merged.vcf \
    --potential_variants barcode91.merged.vcf.gz

The BAM file I'm using there has only one read, with few bases, resulting in a very small coverage. This was expected, since these were negative samples, where we shouldn't be finding (Sars-Cov-2) reads. Longshot ends up erroring with the following messages:

2021-08-23 10:03:32 Automatically determining max read coverage.
2021-08-23 10:03:32 Estimating mean read coverage...
2021-08-23 10:03:32 WARNING: Max coverage calculation is highly likely to be incorrect. The number of reference bases covered by the bam file (857) differs significantly from the expected number of positions in the reference (29903). If you are using a bam file that only covers part of the genome, please specify this region exactly with the --region argument so the number of reference bases is known. Alternatively, disable maximum coverage filtering by setting -C to a large number.
2021-08-23 10:03:32 Total reference positions: 29903
2021-08-23 10:03:32 Total bases in bam: 857
2021-08-23 10:03:32 Mean read coverage: 0.03
error: {} ERROR: Max read coverage set to 0. printing empty VCF file

It seems the problem comes from the -A flag:

-A, --auto_max_cov        Automatically calculate mean coverage for region and set max coverage to mean_coverage +
                              5*sqrt(mean_coverage). (SLOWER)

Since Mean read coverage = 0.03, it calculates Max coverage = 0.03+5*sqrt(0.03) = 0.89602540378, and rounds down to 0.

Would it be possible to have -A round the maximum coverage up, instead of down?

vibansal commented 3 years ago

Longshot uses the input reads to estimate sequencing error rates so it will not work well for files with very few reads. Can you not use the -A flag?

Coppini commented 3 years ago

Yes, I believe that would be better as well. I'm using the existing Artic pipeline as mentioned, but already opened an issue with them concerning this problem. https://github.com/artic-network/fieldbioinformatics/issues/91

Thank you either way.