virus-evolution / gofasta

MIT License
35 stars 1 forks source link

annotation features from reverse strand #37

Open fifthguy opened 2 years ago

fifthguy commented 2 years ago

Error: strconv.Atoi: parsing "complement(3956": invalid syntax

benjamincjackson commented 2 years ago

Hi Tomaž,

Sorry about the reverse strand thing, I will start working on a fix for this. Everything has been so SARS-CoV-2-centric...

Do you have example commands (and files) for the empty fasta file error?

fifthguy commented 2 years ago

Hi, don't worry, I guess this is what the issues sections are for. You needed the code to do something, I need it to do something else. Thank you for the effort in the first place, as well as for this discussion.

I was working on some other files on Friday for recreation I toyed genbank files today to recreate. The second point seems to be more complicated (as in different errors, but cannot recreate the one I wrote about... for some reason)

first efetch the fasta and gb files for OP351278 and KJ642619.

efetch -id ID -db nucleotide -format fasta/gb > ID.fa/gb

minimap2 -a -x asm20 --score-N=0 OP351278.fa KJ642619.fa | gofasta sam topa -r OP351278.fa -o out3 [12:35:19] [M::mm_idx_gen::0.0131.11] collected minimizers [M::mm_idx_gen::0.0201.66] sorted minimizers [M::main::0.0201.66] loaded/built the index for 1 target sequence(s) [M::mm_mapopt_update::0.0211.62] mid_occ = 100 [M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 0; #seq: 1 [M::mm_idx_stat::0.023*1.58] distinct minimizers: 34502 (96.70% are singletons); average occurrences: 1.036; average spacing: 5.519 panic: runtime error: index out of range [0] with length 0

goroutine 19 [running]: github.com/virus-evolution/gofasta/pkg/fastaio.ReadAlignment({0x6412a8?, 0xc0001980b8?}, 0x0?, 0x0?, 0x0?) //gofasta-1.1.0/pkg/fastaio/fastaio.go:233 +0x6f9 created by github.com/virus-evolution/gofasta/pkg/sam.ToPairAlign //gofasta-1.1.0/pkg/sam/topa.go:407 +0x1ac

But for some reason this works with OP351278 as the query and KJ642619 as the reference. So i guess there is an issue with the OP... sequence.

then:

gofasta variants --msa out3/OP351278.1.fasta -r KJ642619.1 -a KJ642619.gb [12:40:40] Error: bufio.Scanner: token too long

so i tried shortening the buffers - the linewidth for the msa input files:

seqtk seq -U -l 90 out3/OP351278.1.fasta > bla.fa

which reproduces the error:

gofasta variants --msa bla.fa -r KJ642619.1 -a KJ642619.gb [12:43:03] Error: strconv.Atoi: parsing "complement(817": invalid syntax

Going directly from SAM file (gofasta sam variants) raises the same error.

I didnt manage to recreate the "empty fasta" error in this context, so please forget about that, i might have been to tired on Friday.

Then i try:

gofasta snps -q out3/OP351278.1.fasta -r KJ642619.fa [12:45:47] query,SNPs Error: empty fasta file

In this context i do get the empty fasta error; i thought it might have been a similar error as above (buffer length), but that results in (bla.fa from above):

gofasta snps -q bla.fa -r KJ642619.fa [12:49:21] query,SNPs panic: runtime error: index out of range [196546] with length 196546

goroutine 9 [running]: github.com/virus-evolution/gofasta/pkg/snps.getSNPs({0xc000280000, 0x2ffc2, 0x36000?}, 0xc0000202a0?, 0xc000016720?, 0xc000020180?) //gofasta-1.1.0/pkg/snps/snps.go:37 +0x3c9 github.com/virus-evolution/gofasta/pkg/snps.SNPs.func1() //gofasta-1.1.0/pkg/snps/snps.go:194 +0x3d created by github.com/virus-evolution/gofasta/pkg/snps.SNPs /***/gofasta-1.1.0/pkg/snps/snps.go:193 +0x608

This error is independent of the OP... sequence.

On Friday I was toying a multifasta. I couldnt get the topa work with the multifasta (~70 seqs × 200 kb), it gave the error: "Error: sam: sequence/CIGAR length mismatch" around sequence no. 37... but it worked looping through the sequences as separate fasta files.

benjamincjackson commented 2 years ago

: ) okay fun. I will investigate all of this and get back to you.

Ben

benjamincjackson commented 2 years ago

I haven't finished the reverse strand annotation fix yet- I will try to do it by the end of the week.

But I have made a couple of fixes that should help with the other things: I've updated the buffer size so you should be able to read in fasta records with up to a million bases on one line, and I've tried to improve the error handling for the snps routine.

If you want to try out the program you will first need to install go, then you can either run go install github.com/virus-evolution/gofasta@3d60901 (should be built as ~/go/bin/gofasta), or,

git clone https://github.com/virus-evolution/gofasta.git
cd gofasta
git checkout 3d60901
go build -o gofasta

to install gofasta at that commit.

In either case, check that gofasta is version 1.1.1:

❯ ./gofasta --version  # or ~/go/bin/gofasta --version
gofasta version 1.1.1

This is my minimap2 version:

❯ minimap2 --version
2.24-r1122

setup:

efetch -id OP351278 -db nucleotide -format fasta > OP351278.fasta
efetch -id OP351278 -db nucleotide -format gb > OP351278.gb
efetch -id KJ642619 -db nucleotide -format gb > KJ642619.gb
efetch -id KJ642619 -db nucleotide -format fasta > KJ642619.fasta

Both of these work for me now:

❯ minimap2 -a -x asm20 --score-N=0 OP351278.fasta KJ642619.fasta | gofasta sam topa -r OP351278.fasta -o OPKJ
[M::mm_idx_gen::0.008*1.08] collected minimizers
[M::mm_idx_gen::0.012*1.63] sorted minimizers
[M::main::0.012*1.56] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.013*1.53] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.014*1.50] distinct minimizers: 34502 (96.70% are singletons); average occurrences: 1.036; average spacing: 5.519; total length: 197178
[M::worker_pipeline::0.103*1.07] mapped 1 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -a -x asm20 --score-N=0 OP351278.fasta KJ642619.fasta
[M::main] Real time: 0.108 sec; CPU: 0.115 sec; Peak RSS: 0.040 GB
❯ ll OPKJ/
total 784
-rw-r--r--  1 ben  staff  399728  6 Sep 11:54 KJ642619.1.fasta
❯ wc -l OPKJ/KJ642619.1.fasta
       4 OPKJ/KJ642619.1.fasta

and

❯ minimap2 -a -x asm20 --score-N=0 KJ642619.fasta OP351278.fasta | gofasta sam topa -r KJ642619.fasta -o KJOP
[M::mm_idx_gen::0.007*1.30] collected minimizers
[M::mm_idx_gen::0.010*1.87] sorted minimizers
[M::main::0.010*1.87] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.011*1.81] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.012*1.77] distinct minimizers: 34450 (96.62% are singletons); average occurrences: 1.036; average spacing: 5.505; total length: 196546
[M::worker_pipeline::0.102*1.09] mapped 1 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -a -x asm20 --score-N=0 KJ642619.fasta OP351278.fasta
[M::main] Real time: 0.106 sec; CPU: 0.115 sec; Peak RSS: 0.040 GB
❯ ll KJOP
total 784
-rw-r--r--  1 ben  staff  399656  6 Sep 11:57 OP351278.1.fasta
❯ wc -l KJOP/OP351278.1.fasta
       4 KJOP/OP351278.1.fasta

gofasta snps wants the --reference file and the --query alignments to be the same width - I wrote it to get lists of nucleotide differences between the reference and the output of gofasta sam toma very quickly for SARS-CoV-2, so you shouldn't actually run gofasta snps -r KJ642619.fasta -q KJOP/OP351278.1.fasta because after pairwise alignment those two files are different widths (because insertions have been introduced to KJOP/OP351278.1.fasta in a pairwise manner). Previously, 1) gofasta snps didn't like reading such a wide alignment anyway which resulted in one of the errors you report above but also 2) there was no sensible error check for different length alignments anyway: I've fixed this now:

❯ gofasta snps -r KJ642619.fasta  -q KJOP/OP351278.1.fasta
query,SNPs
Error: Reference sequence (196546 bases) and KJ642619.1 (199815 bases) are different lengths

KJ642619.1 is the first record in KJOP/OP351278.1.fasta so this is the error message that gets returned.

Until I fix the annotation reading for features on the reverse strand, if you want a list of snp differences in KJ642619.fasta space, you could use gofasta sam toma, which omits insertions relative to the reference:

❯ minimap2 -a -x asm20 --score-N=0 KJ642619.fasta OP351278.fasta | gofasta sam toma -o aligned.fasta
❯ gofasta snps -r KJ642619.fasta -q aligned.fasta
query,SNPs
OP351278.1,C110T|C186A|C448T...

But I will try to get variants and sam variants fixed asap so that you can annotate mutations properly.

fifthguy commented 2 years ago

Hi,

thanks for that; i also tried with your new code, that error is now gone. Please don't stress with the rest, I went back to my traditional, a bit less, straightforward workaround which uses goes through snpEff (via vcf) for the annotation (although inconvenient) and i'm writing these issues because this repo seems promising.

I can wait for the annotation part to work another time.

Anyway I also tried with the long multifasta, it still breaks around sequence no 31 for some reason, so the problem seems unrelated.

benjamincjackson commented 2 years ago

Ah it's no stress- I find it helpful to hear what doesn't work/what would be useful. It is much easier than trying to write everything in a vacuum...

Feel free to drop the long multifasta here if you like.

benjamincjackson commented 1 year ago

Sorry for the delay.

I hope that I have fixed all these issues by this point:

https://github.com/virus-evolution/gofasta/tree/0442d351e2c5ffa7a561b1e37aee61a6121c004e

So you can run go install github.com/virus-evolution/gofasta@0442d35, or

git clone https://github.com/virus-evolution/gofasta.git
cd gofasta
git checkout 0442d35
go build -o gofasta

to get a working binary.

I will tag a new release (1.2.0) incorporating these changes at some point soon.