virus-evolution / gofasta

MIT License
31 stars 1 forks source link

panic: runtime error: index out of range [29899] with length 29899 #41

Open Xiang-Leo opened 1 year ago

Xiang-Leo commented 1 year ago

When I used gofasta sam variants, it reported this error:

gofasta sam variants -a /Raid5/lixiang/Indel_accession/reference/NC_045512.2.gb --start 266 --end 29674 --append-snps -t 36 -s 01_20230220_filtered_sequences.sam -o 02_20230220_mutation.indel.csv

ambiguous overlapping alignment: hCoV-19/Brazil/AP-ITV-180097/2021: 28785: AG
ambiguous overlapping alignment: hCoV-19/Brazil/AP-ITV-180097/2021: 28786: GN
ambiguous overlapping alignment: hCoV-19/Brazil/AP-ITV-180097/2021: 28874: CT
panic: runtime error: index out of range [29899] with length 29899

goroutine 146 [running]:
github.com/virus-evolution/gofasta/pkg/variants.getNucsPair({0xc000292000, 0x75fa, 0x1?}, {0xc000286000, 0x75fa, 0x80?}, {0xc000434480, 0x283, 0x74cf?}, {0xc00e02e000, ...}, ...)
        /opt/conda/conda-bld/gofasta_1674483676002/work/pkg/variants/pairwise.go:77 +0x39b
github.com/virus-evolution/gofasta/pkg/variants.GetVariantsPair({0xc000292000, 0x75fa, 0xa000}, {0xc000286000, 0x75fa, 0xa000}, {0xc00025c4c0?, 0xb?}, {0xc00af383f0, 0x26}, ...)
        /opt/conda/conda-bld/gofasta_1674483676002/work/pkg/variants/variants.go:651 +0x185
github.com/virus-evolution/gofasta/pkg/sam.getVariantsSam({0xc0001f8000, 0xc, 0x10}, {0xc000434480, 0x283, 0x350}, 0x4141474143544354?, 0x4754474141414341?, 0x4147545447434154?)
        /opt/conda/conda-bld/gofasta_1674483676002/work/pkg/sam/variants.go:199 +0x36e
github.com/virus-evolution/gofasta/pkg/sam.Variants.func2()
        /opt/conda/conda-bld/gofasta_1674483676002/work/pkg/sam/variants.go:123 +0x4f
created by github.com/virus-evolution/gofasta/pkg/sam.Variants
        /opt/conda/conda-bld/gofasta_1674483676002/work/pkg/sam/variants.go:122 +0xcfc

I searched the sequence of hCoV-19/Brazil/AP-ITV-180097/2021. This sequence had an insertion of 65 nucleotides and 30018 nucleotides. How can I solve this problem?

benjamincjackson commented 1 year ago

Could you 1) report which version of gofasta you are using, and 2) if possible attach the sam file 01_20230220_filtered_sequences.sam to this issue?

Xiang-Leo commented 1 year ago

1) gofasta version: 1.2.0; 2) Sorry. I can't upload this sam file since this file is too large which is over 300GB. I guess this problem was caused by the sequence hCoV-19/Brazil/AP-ITV-180097/2021. The terminal of this sequence was mismatch with WIV04.

benjamincjackson commented 1 year ago

I can replicate the first part of your stderr output with hCoV-19/Brazil/AP-ITV-180097/2021, but not the panic and its error trace, which makes me wonder if that part is due to another sequence.

If you had time you could check what 02_20230220_mutation.indel.csv looks like to see if there was output past hCoV-19/Brazil/AP-ITV-180097/2021: that would confirm my suspicions that there is another problematic sequence later on in the file.

(If you run gofasta sam variants with one thread then the problematic sequence is likely to be the sequence after the final sequence written to the output file before the program crashes)

The problem with hCoV-19/Brazil/AP-ITV-180097/2021 seems likely to be a bug- so I will have a look at that anyway.

Xiang-Leo commented 1 year ago

I try to rerun gofasta sam variants with single thread and search the sequence hCoV-19/Brazil/AP-ITV-180097/2021, it reports this results:

grep "AP-ITV-180097" 02_20230220_mutation.indel.test.csv
hCoV-19/Brazil/AP-ITV-180097/2021,nuc:T733C|nuc:C2749T|nuc:C3037T|aa:ORF1ab:S1188L(nuc:C3828T)|ins:5550:1|aa:ORF1ab:K1795Q(nuc:A5648C)|nuc:A6319G|nuc:A6613G|nuc:T9124A|aa:ORF1ab:Y2954S(nuc:A9126C)|nuc:C12325T|nuc:C12778T|nuc:C13860T|aa:ORF1ab:P4715L(nuc:C14408T)|aa:ORF1ab:E5665D(nuc:G17259T)|aa:ORF1ab:V5721M(nuc:G17425A)|nuc:C21595T|aa:S:L18F(nuc:C21614T)|aa:S:T20N(nuc:C21621A)|aa:S:P26S(nuc:C21638T)|aa:S:D138Y(nuc:G21974T)|aa:S:R190S(nuc:G22132T)|aa:S:K417T(nuc:A22812C)|aa:S:E484K(nuc:G23012A)|aa:S:N501Y(nuc:A23063T)|aa:S:D614G(nuc:A23403G)|aa:S:H655Y(nuc:C23525T)|aa:S:T1027I(nuc:C24642T)|aa:S:V1176F(nuc:G25088T)|aa:ORF3a:K21N(nuc:G25455T)|aa:ORF3a:S253P(nuc:T26149C)|aa:ORF8:E92K(nuc:G28167A)|ins:28263:4|aa:N:P80R(nuc:C28512G)|aa:N:T135I(nuc:C28677T)|aa:N:F171L()|aa:N:Y172L()|aa:N:A173R()|aa:N:E174R()|aa:N:G175R()|aa:N:S176E()|aa:N:R177Q()|aa:N:G178R()|aa:N:G179R()|aa:N:S180Q()|aa:N:Q181S()|aa:N:A182S()|aa:N:S183L()|aa:N:S184F()|aa:N:R185S()|aa:N:S186F()|aa:N:S187L()|aa:N:S188I()|aa:N:R189T()|aa:N:S190*()|aa:N:R191S()|aa:N:N192Q()|aa:N:S193Q()|aa:N:S194F()|aa:N:R195K()|aa:N:N196K()|aa:N:S197F()|aa:N:T198N()|aa:N:P199S()|aa:N:G200R()|aa:N:S201Q()|aa:N:S202L(nuc:A28878T;nuc:G28879C)|aa:N:R203*(nuc:G28882A)|aa:N:G204T(nuc:G28883A;nuc:G28884C)|aa:N:T205N()|aa:N:S206F()|aa:N:P207S()|aa:N:A208C()|aa:N:R209*()|aa:N:M210N()|aa:N:A211G()|aa:N:G212W()|aa:N:N213Q()|aa:N:G214W()|aa:N:G215R()|aa:N:D216*()|aa:N:A217C()|aa:N:A218C()|aa:N:L219S()|aa:N:A220C()|aa:N:L221F()|aa:N:L222A()|aa:N:L223A()|aa:N:L224A()|aa:N:D225*()|aa:N:R226Q()|aa:N:L227I()|aa:N:N228E()|aa:N:Q229P()|aa:N:L230A()|aa:N:E231*()|aa:N:S232E()|aa:N:K233Q()|aa:N:M234N()|aa:N:S235V()|aa:N:G236W()|aa:N:K237*()|aa:N:G238R()|aa:N:Q239P()|aa:N:Q240T()|aa:N:Q241T()|aa:N:Q242T()|aa:N:G243R()|aa:N:Q244P()|aa:N:T245N()|aa:N:V246C()|aa:N:T247H()|aa:N:K248*()|aa:N:K249E()|aa:N:S250I()|aa:N:A251C()|aa:N:A252C()|aa:N:E253*()|aa:N:A254G()|aa:N:S255F()|aa:N:K256*()|aa:N:K257E()|aa:N:P258A()|aa:N:R259S()|aa:N:Q260A()|aa:N:R262T()|aa:N:T263Y()|aa:N:A264C()|aa:N:T265H()|aa:N:K266*()|aa:N:A267S()|aa:N:Y268I()|aa:N:N269Q()|aa:N:V270C()|aa:N:T271N()|aa:N:Q272T()|aa:N:A273S()|aa:N:G275R()|aa:N:R276Q()|aa:N:R277T()|aa:N:G278W()|aa:N:P279S()|aa:N:E280R()|aa:N:Q281T()|aa:N:T282N()|aa:N:Q283P()|aa:N:G284R()|aa:N:N285K()|aa:N:G287W()|aa:N:D288G()|aa:N:Q289P()|aa:N:E290G()|aa:N:L291T()|aa:N:I292N()|aa:N:R293Q()|aa:N:Q294T()|aa:N:G295R()|aa:N:T296N()|aa:N:D297*()|aa:N:Y298L()|aa:N:K299Q()|aa:N:H300T()|aa:N:W301L()|aa:N:P302A()|aa:N:Q303A()|aa:N:I304N()|aa:N:A305C()|aa:N:Q306T()|aa:N:F307I()|aa:N:A308C()|aa:N:S310Q()|aa:N:A311R()|aa:N:S312F()|aa:N:A313S()|aa:N:F314V()|aa:N:F315L()|aa:N:G316R()|aa:N:M317N()|aa:N:S318V()|aa:N:R319A()|aa:N:I320H()|aa:N:G321W()|aa:N:M322H()|aa:N:E323G()|aa:N:V324S()|aa:N:T325H()|aa:N:P326T()|aa:N:S327F()|aa:N:T329N()|aa:N:W330V()|aa:N:L331V()|aa:N:T332D()|aa:N:Y333L()|aa:N:T334H()|aa:N:G335R()|aa:N:A336C()|aa:N:I337H()|aa:N:K338Q()|aa:N:L339I()|aa:N:D340G()|aa:N:D341*()|aa:N:K342Q()|aa:N:D343R()|aa:N:P344S()|aa:N:N345K()|aa:N:K347Q()|aa:N:D348R()|aa:N:Q349S()|aa:N:V350S()|aa:N:I351H()|aa:N:L352F()|aa:N:L353A()|aa:N:N354E()|aa:N:K355*()|aa:N:H356A()|aa:N:I357Y()|aa:N:D358*()|aa:N:A359R()|aa:N:Y360I()|aa:N:K361Q()|aa:N:T362N()|aa:N:F363I()|aa:N:P365T()|aa:N:T366N()|aa:N:E367R()|aa:N:P368A()|aa:N:K369*()|aa:N:D371G()|aa:N:K372Q()|aa:N:K374E()|aa:N:K375E()|aa:N:A376G()|aa:N:D377*()|aa:N:E378*()|aa:N:T379N()|aa:N:Q380S()|aa:N:A381S()|aa:N:P383T()|aa:N:Q384A()|aa:N:R385E()|aa:N:Q386T()|aa:N:K387E()|aa:N:K388E()|aa:N:Q389T()|aa:N:Q390A()|aa:N:T391N()|aa:N:V392C()|aa:N:T393D()|aa:N:L394S()|aa:N:L395S()|aa:N:P396S()|aa:N:A397C()|aa:N:A398C()|aa:N:D399R()|aa:N:L400F()|aa:N:D401G()|aa:N:D402*()|aa:N:S404L()|aa:N:K405Q()|aa:N:Q406T()|aa:N:L407I()|aa:N:Q408A()|aa:N:Q409T()|aa:N:S410I()|aa:N:M411H()|aa:N:S412E()|aa:N:S413Q()|aa:N:A414C()|aa:N:D415*()|aa:N:S416L()|aa:N:T417N()|aa:N:Q418S()|aa:N:A419G()|aa:N:*420L()|aa:ORF10:M1D()|aa:ORF10:Y3L()|aa:ORF10:I4Y()|aa:ORF10:N5K()|aa:ORF10:V6R()|aa:ORF10:A8R()|aa:ORF10:P10S()|aa:ORF10:F11V()|aa:ORF10:T12Y()|aa:ORF10:I13D()|aa:ORF10:Y14I()|aa:ORF10:S15*()|aa:ORF10:L16S()|aa:ORF10:L17T()|aa:ORF10:C19V()|aa:ORF10:R20Q()|aa:ORF10:M21N()|aa:ORF10:N22E()|aa:ORF10:S23F()|aa:ORF10:R24S()|aa:ORF10:N25*()|aa:ORF10:Y26L()|aa:ORF10:I27H()|aa:ORF10:A28S()|aa:ORF10:Q29T()|aa:ORF10:V30S()|aa:ORF10:D31R()|aa:ORF10:V32C()|aa:ORF10:V33S()|aa:ORF10:N34*()|aa:ORF10:F35L()|aa:ORF10:N36*()|aa:ORF10:L37S()|aa:ORF10:T38H()|aa:ORF10:*39I()

Then I checked for the last record before error and it is hCoV-19/USA/WA-UW231/2020. Then I searched the sequence after this with command grep -A 2 "hCoV-19/USA/WA-UW231/2020" 01_20230220_filtered_sequences.sam. The sequence after hCoV-19/USA/WA-UW231/2020 is this:

hCoV-19/Netherlands/UT-RIVM-62129/2021  0       NC_045512.2     55      60      21974M6D997M3D5213M6D17M1D1565M *       0       0       *       NM:i:338        ms:i:29275      AS:i:28970      nn:i:290        tp:A:P  cm:i:5236       s1:i:29132      s2:i:0  de:f:0.0110     rl:i:0
hCoV-19/Sweden/UUH-000199815367N2/2021  0       NC_045512.2     20172   60      20048S175M1D665M3I16M3D6M2I15M2D401M8D26M1D16M1D6M2I6M2I5M6I45M1D194M6D220M3D68M5D6M1I4M4I131M1D7M1I37M1I400M3D23M2I19M2I19M1D426M1D704M6D8M6I490M7D12M1D10M8I257M8D11M1D37M9I379M4D13M1I31M1D7M1I557M2I4M5D19M3I139M2D21M5D11M5I570M1D4M1I10M2I26M3I11M4D695M3I8M3D15M1I4M1D17M2D19M2I55M1I32M1D35M1I17M2435S    *       0       0       *       NM:i:632        ms:i:4676       AS:i:4364       nn:i:71 tp:A:P  cm:i:927        s1:i:5494    s2:i:0  de:f:0.0727     zd:i:3  SA:Z:NC_045512.2,11864,+,11791S4645M2D13286S,60,170;NC_045512.2,7035,+,6963S4657M2D18102S,60,399;NC_045512.2,2827,+,2761S2190M1D24771S,60,41;NC_045512.2,16749,+,16673S3191M9I9849S,60,341;NC_045512.2,5258,+,5189S1616M22917S,60,95;NC_045512.2,27785,+,27608S2077M5D37S,60,202;NC_045512.2,944,+,940S1320M1D27462S,60,131;NC_045512.2,31,+,30S696M28996S,60,36; rl:i:0

The sequence hCoV-19/Netherlands/UT-RIVM-62129/2021 seems no problem?

benjamincjackson commented 1 year ago

Thank you very much for the helpful reply. hCoV-19/Sweden/UUH-000199815367N2/2021 is the sequence that is being processed when gofasta crashes. I am working on this now.

benjamincjackson commented 1 year ago

I made a temporary fix that means hCoV-19/Sweden/UUH-000199815367N2/2021 and similar sequences will be skipped and a sensible error message will be printed to stderr. This sequence is ~5% diverged from Wuhan-1 and the alignment looks like a mess.

BUT the root cause is a bug in the sam -> fasta conversion that isn't fixed, so be wary of the output at the moment. I will try to fix this next.

To get the new binary you will need to install go, then you can either run go install github.com/virus-evolution/gofasta@b62b578 (should be built as ~/go/bin/gofasta), or,

git clone https://github.com/virus-evolution/gofasta.git
cd gofasta
git checkout b62b578
go build -o gofasta

Hopefully then you can run through your big sam file without the program crashing (unless there are other problems).

Xiang-Leo commented 1 year ago

Sorry for not replying for so long. It takes me for a long time to do quality control for sequences. Unfortunately it reported another error. I'm still running sam variant and it's not over (almost a week). I will give you response as long as it's over.