Open Xiang-Leo opened 1 year ago
Could you 1) report which version of gofasta
you are using, and 2) if possible attach the sam file 01_20230220_filtered_sequences.sam
to this issue?
1) gofasta version: 1.2.0;
2) Sorry. I can't upload this sam file since this file is too large which is over 300GB.
I guess this problem was caused by the sequence hCoV-19/Brazil/AP-ITV-180097/2021
. The terminal of this sequence was mismatch with WIV04.
I can replicate the first part of your stderr
output with hCoV-19/Brazil/AP-ITV-180097/2021
, but not the panic and its error trace, which makes me wonder if that part is due to another sequence.
If you had time you could check what 02_20230220_mutation.indel.csv
looks like to see if there was output past hCoV-19/Brazil/AP-ITV-180097/2021
: that would confirm my suspicions that there is another problematic sequence later on in the file.
(If you run gofasta sam variants
with one thread then the problematic sequence is likely to be the sequence after the final sequence written to the output file before the program crashes)
The problem with hCoV-19/Brazil/AP-ITV-180097/2021
seems likely to be a bug- so I will have a look at that anyway.
I try to rerun gofasta sam variants
with single thread and search the sequence hCoV-19/Brazil/AP-ITV-180097/2021
, it reports this results:
grep "AP-ITV-180097" 02_20230220_mutation.indel.test.csv
hCoV-19/Brazil/AP-ITV-180097/2021,nuc:T733C|nuc:C2749T|nuc:C3037T|aa:ORF1ab:S1188L(nuc:C3828T)|ins:5550:1|aa:ORF1ab:K1795Q(nuc:A5648C)|nuc:A6319G|nuc:A6613G|nuc:T9124A|aa:ORF1ab:Y2954S(nuc:A9126C)|nuc:C12325T|nuc:C12778T|nuc:C13860T|aa:ORF1ab:P4715L(nuc:C14408T)|aa:ORF1ab:E5665D(nuc:G17259T)|aa:ORF1ab:V5721M(nuc:G17425A)|nuc:C21595T|aa:S:L18F(nuc:C21614T)|aa:S:T20N(nuc:C21621A)|aa:S:P26S(nuc:C21638T)|aa:S:D138Y(nuc:G21974T)|aa:S:R190S(nuc:G22132T)|aa:S:K417T(nuc:A22812C)|aa:S:E484K(nuc:G23012A)|aa:S:N501Y(nuc:A23063T)|aa:S:D614G(nuc:A23403G)|aa:S:H655Y(nuc:C23525T)|aa:S:T1027I(nuc:C24642T)|aa:S:V1176F(nuc:G25088T)|aa:ORF3a:K21N(nuc:G25455T)|aa:ORF3a:S253P(nuc:T26149C)|aa:ORF8:E92K(nuc:G28167A)|ins:28263:4|aa:N:P80R(nuc:C28512G)|aa:N:T135I(nuc:C28677T)|aa:N:F171L()|aa:N:Y172L()|aa:N:A173R()|aa:N:E174R()|aa:N:G175R()|aa:N:S176E()|aa:N:R177Q()|aa:N:G178R()|aa:N:G179R()|aa:N:S180Q()|aa:N:Q181S()|aa:N:A182S()|aa:N:S183L()|aa:N:S184F()|aa:N:R185S()|aa:N:S186F()|aa:N:S187L()|aa:N:S188I()|aa:N:R189T()|aa:N:S190*()|aa:N:R191S()|aa:N:N192Q()|aa:N:S193Q()|aa:N:S194F()|aa:N:R195K()|aa:N:N196K()|aa:N:S197F()|aa:N:T198N()|aa:N:P199S()|aa:N:G200R()|aa:N:S201Q()|aa:N:S202L(nuc:A28878T;nuc:G28879C)|aa:N:R203*(nuc:G28882A)|aa:N:G204T(nuc:G28883A;nuc:G28884C)|aa:N:T205N()|aa:N:S206F()|aa:N:P207S()|aa:N:A208C()|aa:N:R209*()|aa:N:M210N()|aa:N:A211G()|aa:N:G212W()|aa:N:N213Q()|aa:N:G214W()|aa:N:G215R()|aa:N:D216*()|aa:N:A217C()|aa:N:A218C()|aa:N:L219S()|aa:N:A220C()|aa:N:L221F()|aa:N:L222A()|aa:N:L223A()|aa:N:L224A()|aa:N:D225*()|aa:N:R226Q()|aa:N:L227I()|aa:N:N228E()|aa:N:Q229P()|aa:N:L230A()|aa:N:E231*()|aa:N:S232E()|aa:N:K233Q()|aa:N:M234N()|aa:N:S235V()|aa:N:G236W()|aa:N:K237*()|aa:N:G238R()|aa:N:Q239P()|aa:N:Q240T()|aa:N:Q241T()|aa:N:Q242T()|aa:N:G243R()|aa:N:Q244P()|aa:N:T245N()|aa:N:V246C()|aa:N:T247H()|aa:N:K248*()|aa:N:K249E()|aa:N:S250I()|aa:N:A251C()|aa:N:A252C()|aa:N:E253*()|aa:N:A254G()|aa:N:S255F()|aa:N:K256*()|aa:N:K257E()|aa:N:P258A()|aa:N:R259S()|aa:N:Q260A()|aa:N:R262T()|aa:N:T263Y()|aa:N:A264C()|aa:N:T265H()|aa:N:K266*()|aa:N:A267S()|aa:N:Y268I()|aa:N:N269Q()|aa:N:V270C()|aa:N:T271N()|aa:N:Q272T()|aa:N:A273S()|aa:N:G275R()|aa:N:R276Q()|aa:N:R277T()|aa:N:G278W()|aa:N:P279S()|aa:N:E280R()|aa:N:Q281T()|aa:N:T282N()|aa:N:Q283P()|aa:N:G284R()|aa:N:N285K()|aa:N:G287W()|aa:N:D288G()|aa:N:Q289P()|aa:N:E290G()|aa:N:L291T()|aa:N:I292N()|aa:N:R293Q()|aa:N:Q294T()|aa:N:G295R()|aa:N:T296N()|aa:N:D297*()|aa:N:Y298L()|aa:N:K299Q()|aa:N:H300T()|aa:N:W301L()|aa:N:P302A()|aa:N:Q303A()|aa:N:I304N()|aa:N:A305C()|aa:N:Q306T()|aa:N:F307I()|aa:N:A308C()|aa:N:S310Q()|aa:N:A311R()|aa:N:S312F()|aa:N:A313S()|aa:N:F314V()|aa:N:F315L()|aa:N:G316R()|aa:N:M317N()|aa:N:S318V()|aa:N:R319A()|aa:N:I320H()|aa:N:G321W()|aa:N:M322H()|aa:N:E323G()|aa:N:V324S()|aa:N:T325H()|aa:N:P326T()|aa:N:S327F()|aa:N:T329N()|aa:N:W330V()|aa:N:L331V()|aa:N:T332D()|aa:N:Y333L()|aa:N:T334H()|aa:N:G335R()|aa:N:A336C()|aa:N:I337H()|aa:N:K338Q()|aa:N:L339I()|aa:N:D340G()|aa:N:D341*()|aa:N:K342Q()|aa:N:D343R()|aa:N:P344S()|aa:N:N345K()|aa:N:K347Q()|aa:N:D348R()|aa:N:Q349S()|aa:N:V350S()|aa:N:I351H()|aa:N:L352F()|aa:N:L353A()|aa:N:N354E()|aa:N:K355*()|aa:N:H356A()|aa:N:I357Y()|aa:N:D358*()|aa:N:A359R()|aa:N:Y360I()|aa:N:K361Q()|aa:N:T362N()|aa:N:F363I()|aa:N:P365T()|aa:N:T366N()|aa:N:E367R()|aa:N:P368A()|aa:N:K369*()|aa:N:D371G()|aa:N:K372Q()|aa:N:K374E()|aa:N:K375E()|aa:N:A376G()|aa:N:D377*()|aa:N:E378*()|aa:N:T379N()|aa:N:Q380S()|aa:N:A381S()|aa:N:P383T()|aa:N:Q384A()|aa:N:R385E()|aa:N:Q386T()|aa:N:K387E()|aa:N:K388E()|aa:N:Q389T()|aa:N:Q390A()|aa:N:T391N()|aa:N:V392C()|aa:N:T393D()|aa:N:L394S()|aa:N:L395S()|aa:N:P396S()|aa:N:A397C()|aa:N:A398C()|aa:N:D399R()|aa:N:L400F()|aa:N:D401G()|aa:N:D402*()|aa:N:S404L()|aa:N:K405Q()|aa:N:Q406T()|aa:N:L407I()|aa:N:Q408A()|aa:N:Q409T()|aa:N:S410I()|aa:N:M411H()|aa:N:S412E()|aa:N:S413Q()|aa:N:A414C()|aa:N:D415*()|aa:N:S416L()|aa:N:T417N()|aa:N:Q418S()|aa:N:A419G()|aa:N:*420L()|aa:ORF10:M1D()|aa:ORF10:Y3L()|aa:ORF10:I4Y()|aa:ORF10:N5K()|aa:ORF10:V6R()|aa:ORF10:A8R()|aa:ORF10:P10S()|aa:ORF10:F11V()|aa:ORF10:T12Y()|aa:ORF10:I13D()|aa:ORF10:Y14I()|aa:ORF10:S15*()|aa:ORF10:L16S()|aa:ORF10:L17T()|aa:ORF10:C19V()|aa:ORF10:R20Q()|aa:ORF10:M21N()|aa:ORF10:N22E()|aa:ORF10:S23F()|aa:ORF10:R24S()|aa:ORF10:N25*()|aa:ORF10:Y26L()|aa:ORF10:I27H()|aa:ORF10:A28S()|aa:ORF10:Q29T()|aa:ORF10:V30S()|aa:ORF10:D31R()|aa:ORF10:V32C()|aa:ORF10:V33S()|aa:ORF10:N34*()|aa:ORF10:F35L()|aa:ORF10:N36*()|aa:ORF10:L37S()|aa:ORF10:T38H()|aa:ORF10:*39I()
Then I checked for the last record before error and it is hCoV-19/USA/WA-UW231/2020
. Then I searched the sequence after this with command grep -A 2 "hCoV-19/USA/WA-UW231/2020" 01_20230220_filtered_sequences.sam
. The sequence after hCoV-19/USA/WA-UW231/2020
is this:
hCoV-19/Netherlands/UT-RIVM-62129/2021 0 NC_045512.2 55 60 21974M6D997M3D5213M6D17M1D1565M * 0 0 * NM:i:338 ms:i:29275 AS:i:28970 nn:i:290 tp:A:P cm:i:5236 s1:i:29132 s2:i:0 de:f:0.0110 rl:i:0
hCoV-19/Sweden/UUH-000199815367N2/2021 0 NC_045512.2 20172 60 20048S175M1D665M3I16M3D6M2I15M2D401M8D26M1D16M1D6M2I6M2I5M6I45M1D194M6D220M3D68M5D6M1I4M4I131M1D7M1I37M1I400M3D23M2I19M2I19M1D426M1D704M6D8M6I490M7D12M1D10M8I257M8D11M1D37M9I379M4D13M1I31M1D7M1I557M2I4M5D19M3I139M2D21M5D11M5I570M1D4M1I10M2I26M3I11M4D695M3I8M3D15M1I4M1D17M2D19M2I55M1I32M1D35M1I17M2435S * 0 0 * NM:i:632 ms:i:4676 AS:i:4364 nn:i:71 tp:A:P cm:i:927 s1:i:5494 s2:i:0 de:f:0.0727 zd:i:3 SA:Z:NC_045512.2,11864,+,11791S4645M2D13286S,60,170;NC_045512.2,7035,+,6963S4657M2D18102S,60,399;NC_045512.2,2827,+,2761S2190M1D24771S,60,41;NC_045512.2,16749,+,16673S3191M9I9849S,60,341;NC_045512.2,5258,+,5189S1616M22917S,60,95;NC_045512.2,27785,+,27608S2077M5D37S,60,202;NC_045512.2,944,+,940S1320M1D27462S,60,131;NC_045512.2,31,+,30S696M28996S,60,36; rl:i:0
The sequence hCoV-19/Netherlands/UT-RIVM-62129/2021
seems no problem?
Thank you very much for the helpful reply. hCoV-19/Sweden/UUH-000199815367N2/2021
is the sequence that is being processed when gofasta crashes. I am working on this now.
I made a temporary fix that means hCoV-19/Sweden/UUH-000199815367N2/2021
and similar sequences will be skipped and a sensible error message will be printed to stderr
. This sequence is ~5% diverged from Wuhan-1 and the alignment looks like a mess.
BUT the root cause is a bug in the sam -> fasta conversion that isn't fixed, so be wary of the output at the moment. I will try to fix this next.
To get the new binary you will need to install go, then you can either run go install github.com/virus-evolution/gofasta@b62b578
(should be built as ~/go/bin/gofasta
), or,
git clone https://github.com/virus-evolution/gofasta.git
cd gofasta
git checkout b62b578
go build -o gofasta
Hopefully then you can run through your big sam file without the program crashing (unless there are other problems).
Sorry for not replying for so long. It takes me for a long time to do quality control for sequences. Unfortunately it reported another error. I'm still running sam variant and it's not over (almost a week). I will give you response as long as it's over.
When I used
gofasta sam variants
, it reported this error:I searched the sequence of
hCoV-19/Brazil/AP-ITV-180097/2021
. This sequence had an insertion of 65 nucleotides and 30018 nucleotides. How can I solve this problem?