Closed rmcolq closed 2 years ago
Thanks @rmcolq . I was the one who first brought this up, and I can post the before/after fastas if that's helpful.
Archive.zip Here's a zip of the input and output fastas.
And one more note. I checked a set of 2569 AY.4 sequences with A222V but not Y145H, and 2341 had this pattern, so it does seem to be a common source of Y145H dropout.
I think that gofasta is failthfully recapitulating the minimap2 alignment here.
For example, with
minimap2 -a -x asm5 MN908947.fa in.fa > out.sam
out.sam
looks like:
@SQ SN:MN908947.3 LN:29903
@PG ID:minimap2 PN:minimap2 VN:2.17-r941 CL:minimap2 -a -x asm5 MN908947.fa in.fa
COGUK/PLYM-205EBC0/SANG:211010_A00714_0417_BHKNJ3DRXY|43684 0 MN908947.3 321 60 320S21396M8174S * 0 0 NNNNNNN...
COGUK/PLYM-205EBC0/SANG:211010_A00714_0417_BHKNJ3DRXY|43684 2048 MN908947.3 21996 60 21995H33M6D6213M6D17M1D1565M67H * 0 0 ACCACA...
i.e the primary alignment stops at position (320 + 21396 =) 21716, and the supplemental alignment is hard-clipped up to and including base 21995, which you can see from the cigar and also very beginning of the seq line that I've otherwise trimmed. Everything between those two is given an N because there's no info from minimap2.
Then as you note:
❯ gofasta sam toMultiAlign -s out.sam | cut -c 21995
N
One possible solution is to add --score-N=0
to the minimap2 command (see this issue). This seems to work for this example sequence for the bit of the genome that we're interested in:
❯ minimap2 -a -x asm5 --score-N=0 -a MN908947.fa in.fa > N0.sam
❯ less N0.sam
@SQ SN:MN908947.3 LN:29903
@PG ID:minimap2 PN:minimap2 VN:2.17-r941 CL:minimap2 -a -x asm5 --score-N=0 -a MN908947.fa in.fa
COGUK/PLYM-205EBC0/SANG:211010_A00714_0417_BHKNJ3DRXY|43684 0 MN908947.3 321 60 320S21708M6D6213M6D17M1D1565M67S ...
❯ gofasta sam toMultiAlign -s N0.sam | cut -c 21995
C
But I don't know how the alignments will differ overall for the rest of the dataset...
Hmm, that seems to be the issue. I don't get the same output from minmap2, though. Using the same command (though a different exact reference file, but it should be identical I would think) I get a single alignment the whole length, rather than two.
minimap2 version is 2.10-r761...what's yours?
Yeah, latest version is 2.22, so I suspect it is something that was introduced at some point...and it also looks like it surprised the poster of that issue. Could try running with --score-N=0 on a bunch of alignments and see what changes?
❯ minimap2 --version
2.17-r941
Back at work next week, I will have a look then
This isn't a bug in gofasta so I will close this.
I did some exploration of minimap2 settings on the whole UK dataset. Rachel plans to implement the change to add --score-N=0
in a full dummy run before incorporating it into the pipeline, but for the record:
I aligned everything that came out of elan on 25th October (1.2million sequences) using:
minimap2 -a -x asm20
minimap2 -a -x asm20 --score-N=0
(the pipeline uses the first command at the moment)
Then called SNPs on both alignments.
~30,000 sequences had a different set of SNPs between the two commands. For every sequence that had different SNP calls, the SNPs from the first command were always a subset of the SNPs from the second command, and the mean number of extra SNPs for those 30,000 seqs using the second command was 1.15.
ATGC coverage increased from 96.69% to 96.71% from the first command to the second.
A closer investigation of AY.4.2 suggested that whether S:145 made it into the alignment complete or not without --score-N=0
was dependent on the extent of the amplicon dropout 3'-wards of S:145, with shorter dropout resulting in an alignment that fully covered 145. Can follow this up elsewhere if needs be...
Datapipe uses the following gofasta command:
As reported on COG-UK slack, AY.4.2 on V3 primers often has amplicon 72 dropout (like all Delta), but also seems to have poor amplicon 73 performance (not entirely sure why). This sometimes results in low (say below 100) depth, but not below the 10x cut-off in the calling pipeline. So you get Y145H called in the consensus, at the very beginning of the amplicon 73. But something happens in the trim/pad step that seems to replace the first few bases in 73 with N, probably related to the nearby 6bp deletion.
For example, PLYM-205EBC0
consensus fasta in this area is: CAACTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCTTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTATCACCACAAAAACAACAAAAGTTGGATGGAAAGTGGAGTTTATTC
showing 72 dropped out, and then TTATC… at the begninning of 73
but the trimpad fasta is: CAACTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACCACAAAAACAACAAAAGTTGGATGGAAAGTG------GAGTTTATTC
where it has correctly aligned the deletion to the ref, but replaced the small amount of non-N sequence in 72 with N’s, and also added five N’s to the beginning of 73 This seems like unexpected behaviour. And what it results in is AY.4.2 without Y145H called, even though the consensus fasta supports it (the “T” at 21995, 5bp into amplicon 73)