virus-evolution / gofasta

MIT License
34 stars 1 forks source link

possible bug in gofasta sam toMultiAlign #21

Closed rmcolq closed 2 years ago

rmcolq commented 2 years ago

Datapipe uses the following gofasta command:

gofasta sam toMultiAlign -t ${task.cpus} \
      --samfile ${sam} \
      --reference ${reference_fasta} \
      --pad \
      -o alignment.fasta

As reported on COG-UK slack, AY.4.2 on V3 primers often has amplicon 72 dropout (like all Delta), but also seems to have poor amplicon 73 performance (not entirely sure why). This sometimes results in low (say below 100) depth, but not below the 10x cut-off in the calling pipeline. So you get Y145H called in the consensus, at the very beginning of the amplicon 73. But something happens in the trim/pad step that seems to replace the first few bases in 73 with N, probably related to the nearby 6bp deletion.

For example, PLYM-205EBC0

consensus fasta in this area is: CAACTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCTTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTATCACCACAAAAACAACAAAAGTTGGATGGAAAGTGGAGTTTATTC

showing 72 dropped out, and then TTATC… at the begninning of 73

but the trimpad fasta is: CAACTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACCACAAAAACAACAAAAGTTGGATGGAAAGTG------GAGTTTATTC

where it has correctly aligned the deletion to the ref, but replaced the small amount of non-N sequence in 72 with N’s, and also added five N’s to the beginning of 73 This seems like unexpected behaviour. And what it results in is AY.4.2 without Y145H called, even though the consensus fasta supports it (the “T” at 21995, 5bp into amplicon 73)

jcbarret commented 2 years ago

Thanks @rmcolq . I was the one who first brought this up, and I can post the before/after fastas if that's helpful.

jcbarret commented 2 years ago

Archive.zip Here's a zip of the input and output fastas.

jcbarret commented 2 years ago

And one more note. I checked a set of 2569 AY.4 sequences with A222V but not Y145H, and 2341 had this pattern, so it does seem to be a common source of Y145H dropout.

benjamincjackson commented 2 years ago

I think that gofasta is failthfully recapitulating the minimap2 alignment here.

For example, with

minimap2 -a -x asm5 MN908947.fa in.fa > out.sam

out.sam looks like:

@SQ     SN:MN908947.3   LN:29903
@PG     ID:minimap2     PN:minimap2     VN:2.17-r941    CL:minimap2 -a -x asm5 MN908947.fa in.fa
COGUK/PLYM-205EBC0/SANG:211010_A00714_0417_BHKNJ3DRXY|43684     0       MN908947.3      321     60      320S21396M8174S *       0       0       NNNNNNN...
COGUK/PLYM-205EBC0/SANG:211010_A00714_0417_BHKNJ3DRXY|43684     2048    MN908947.3      21996   60      21995H33M6D6213M6D17M1D1565M67H *       0       0       ACCACA...

i.e the primary alignment stops at position (320 + 21396 =) 21716, and the supplemental alignment is hard-clipped up to and including base 21995, which you can see from the cigar and also very beginning of the seq line that I've otherwise trimmed. Everything between those two is given an N because there's no info from minimap2.

Then as you note:

❯ gofasta sam toMultiAlign -s out.sam | cut -c 21995

N

One possible solution is to add --score-N=0 to the minimap2 command (see this issue). This seems to work for this example sequence for the bit of the genome that we're interested in:

❯ minimap2 -a -x asm5 --score-N=0 -a MN908947.fa in.fa > N0.sam

❯ less N0.sam
@SQ     SN:MN908947.3   LN:29903
@PG     ID:minimap2     PN:minimap2     VN:2.17-r941    CL:minimap2 -a -x asm5 --score-N=0 -a MN908947.fa in.fa
COGUK/PLYM-205EBC0/SANG:211010_A00714_0417_BHKNJ3DRXY|43684     0       MN908947.3      321     60      320S21708M6D6213M6D17M1D1565M67S ...
❯ gofasta sam toMultiAlign -s N0.sam | cut -c 21995

C

But I don't know how the alignments will differ overall for the rest of the dataset...

sams.zip

jcbarret commented 2 years ago

Hmm, that seems to be the issue. I don't get the same output from minmap2, though. Using the same command (though a different exact reference file, but it should be identical I would think) I get a single alignment the whole length, rather than two.

minimap2 version is 2.10-r761...what's yours?

jcbarret commented 2 years ago

Yeah, latest version is 2.22, so I suspect it is something that was introduced at some point...and it also looks like it surprised the poster of that issue. Could try running with --score-N=0 on a bunch of alignments and see what changes?

benjamincjackson commented 2 years ago
❯ minimap2 --version
2.17-r941

Back at work next week, I will have a look then

benjamincjackson commented 2 years ago

This isn't a bug in gofasta so I will close this.

I did some exploration of minimap2 settings on the whole UK dataset. Rachel plans to implement the change to add --score-N=0 in a full dummy run before incorporating it into the pipeline, but for the record:

I aligned everything that came out of elan on 25th October (1.2million sequences) using:

  1. minimap2 -a -x asm20
  2. minimap2 -a -x asm20 --score-N=0

(the pipeline uses the first command at the moment)

Then called SNPs on both alignments.

~30,000 sequences had a different set of SNPs between the two commands. For every sequence that had different SNP calls, the SNPs from the first command were always a subset of the SNPs from the second command, and the mean number of extra SNPs for those 30,000 seqs using the second command was 1.15.

ATGC coverage increased from 96.69% to 96.71% from the first command to the second.

A closer investigation of AY.4.2 suggested that whether S:145 made it into the alignment complete or not without --score-N=0 was dependent on the extent of the amplicon dropout 3'-wards of S:145, with shorter dropout resulting in an alignment that fully covered 145. Can follow this up elsewhere if needs be...