Open fifthguy opened 2 years ago
Hi Tomaž,
Sorry about the reverse strand thing, I will start working on a fix for this. Everything has been so SARS-CoV-2-centric...
Do you have example commands (and files) for the empty fasta file error?
Hi, don't worry, I guess this is what the issues sections are for. You needed the code to do something, I need it to do something else. Thank you for the effort in the first place, as well as for this discussion.
I was working on some other files on Friday for recreation I toyed genbank files today to recreate. The second point seems to be more complicated (as in different errors, but cannot recreate the one I wrote about... for some reason)
first efetch the fasta and gb files for OP351278 and KJ642619.
efetch -id ID -db nucleotide -format fasta/gb > ID.fa/gb
minimap2 -a -x asm20 --score-N=0 OP351278.fa KJ642619.fa | gofasta sam topa -r OP351278.fa -o out3 [12:35:19] [M::mm_idx_gen::0.0131.11] collected minimizers [M::mm_idx_gen::0.0201.66] sorted minimizers [M::main::0.0201.66] loaded/built the index for 1 target sequence(s) [M::mm_mapopt_update::0.0211.62] mid_occ = 100 [M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 0; #seq: 1 [M::mm_idx_stat::0.023*1.58] distinct minimizers: 34502 (96.70% are singletons); average occurrences: 1.036; average spacing: 5.519 panic: runtime error: index out of range [0] with length 0
goroutine 19 [running]: github.com/virus-evolution/gofasta/pkg/fastaio.ReadAlignment({0x6412a8?, 0xc0001980b8?}, 0x0?, 0x0?, 0x0?) //gofasta-1.1.0/pkg/fastaio/fastaio.go:233 +0x6f9 created by github.com/virus-evolution/gofasta/pkg/sam.ToPairAlign //gofasta-1.1.0/pkg/sam/topa.go:407 +0x1ac
But for some reason this works with OP351278 as the query and KJ642619 as the reference. So i guess there is an issue with the OP... sequence.
then:
gofasta variants --msa out3/OP351278.1.fasta -r KJ642619.1 -a KJ642619.gb [12:40:40] Error: bufio.Scanner: token too long
so i tried shortening the buffers - the linewidth for the msa input files:
seqtk seq -U -l 90 out3/OP351278.1.fasta > bla.fa
which reproduces the error:
gofasta variants --msa bla.fa -r KJ642619.1 -a KJ642619.gb [12:43:03] Error: strconv.Atoi: parsing "complement(817": invalid syntax
Going directly from SAM file (gofasta sam variants) raises the same error.
I didnt manage to recreate the "empty fasta" error in this context, so please forget about that, i might have been to tired on Friday.
Then i try:
gofasta snps -q out3/OP351278.1.fasta -r KJ642619.fa [12:45:47] query,SNPs Error: empty fasta file
In this context i do get the empty fasta error; i thought it might have been a similar error as above (buffer length), but that results in (bla.fa from above):
gofasta snps -q bla.fa -r KJ642619.fa [12:49:21] query,SNPs panic: runtime error: index out of range [196546] with length 196546
goroutine 9 [running]: github.com/virus-evolution/gofasta/pkg/snps.getSNPs({0xc000280000, 0x2ffc2, 0x36000?}, 0xc0000202a0?, 0xc000016720?, 0xc000020180?) //gofasta-1.1.0/pkg/snps/snps.go:37 +0x3c9 github.com/virus-evolution/gofasta/pkg/snps.SNPs.func1() //gofasta-1.1.0/pkg/snps/snps.go:194 +0x3d created by github.com/virus-evolution/gofasta/pkg/snps.SNPs /***/gofasta-1.1.0/pkg/snps/snps.go:193 +0x608
This error is independent of the OP... sequence.
On Friday I was toying a multifasta. I couldnt get the topa work with the multifasta (~70 seqs × 200 kb), it gave the error: "Error: sam: sequence/CIGAR length mismatch" around sequence no. 37... but it worked looping through the sequences as separate fasta files.
: ) okay fun. I will investigate all of this and get back to you.
Ben
I haven't finished the reverse strand annotation fix yet- I will try to do it by the end of the week.
But I have made a couple of fixes that should help with the other things: I've updated the buffer size so you should be able to read in fasta records with up to a million bases on one line, and I've tried to improve the error handling for the snps
routine.
If you want to try out the program you will first need to install go, then you can either run go install github.com/virus-evolution/gofasta@3d60901
(should be built as ~/go/bin/gofasta
), or,
git clone https://github.com/virus-evolution/gofasta.git
cd gofasta
git checkout 3d60901
go build -o gofasta
to install gofasta at that commit.
In either case, check that gofasta is version 1.1.1:
❯ ./gofasta --version # or ~/go/bin/gofasta --version
gofasta version 1.1.1
This is my minimap2 version:
❯ minimap2 --version
2.24-r1122
setup:
efetch -id OP351278 -db nucleotide -format fasta > OP351278.fasta
efetch -id OP351278 -db nucleotide -format gb > OP351278.gb
efetch -id KJ642619 -db nucleotide -format gb > KJ642619.gb
efetch -id KJ642619 -db nucleotide -format fasta > KJ642619.fasta
Both of these work for me now:
❯ minimap2 -a -x asm20 --score-N=0 OP351278.fasta KJ642619.fasta | gofasta sam topa -r OP351278.fasta -o OPKJ
[M::mm_idx_gen::0.008*1.08] collected minimizers
[M::mm_idx_gen::0.012*1.63] sorted minimizers
[M::main::0.012*1.56] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.013*1.53] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.014*1.50] distinct minimizers: 34502 (96.70% are singletons); average occurrences: 1.036; average spacing: 5.519; total length: 197178
[M::worker_pipeline::0.103*1.07] mapped 1 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -a -x asm20 --score-N=0 OP351278.fasta KJ642619.fasta
[M::main] Real time: 0.108 sec; CPU: 0.115 sec; Peak RSS: 0.040 GB
❯ ll OPKJ/
total 784
-rw-r--r-- 1 ben staff 399728 6 Sep 11:54 KJ642619.1.fasta
❯ wc -l OPKJ/KJ642619.1.fasta
4 OPKJ/KJ642619.1.fasta
and
❯ minimap2 -a -x asm20 --score-N=0 KJ642619.fasta OP351278.fasta | gofasta sam topa -r KJ642619.fasta -o KJOP
[M::mm_idx_gen::0.007*1.30] collected minimizers
[M::mm_idx_gen::0.010*1.87] sorted minimizers
[M::main::0.010*1.87] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.011*1.81] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.012*1.77] distinct minimizers: 34450 (96.62% are singletons); average occurrences: 1.036; average spacing: 5.505; total length: 196546
[M::worker_pipeline::0.102*1.09] mapped 1 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -a -x asm20 --score-N=0 KJ642619.fasta OP351278.fasta
[M::main] Real time: 0.106 sec; CPU: 0.115 sec; Peak RSS: 0.040 GB
❯ ll KJOP
total 784
-rw-r--r-- 1 ben staff 399656 6 Sep 11:57 OP351278.1.fasta
❯ wc -l KJOP/OP351278.1.fasta
4 KJOP/OP351278.1.fasta
gofasta snps
wants the --reference
file and the --query
alignments to be the same width - I wrote it to get lists of nucleotide differences between the reference and the output of gofasta sam toma
very quickly for SARS-CoV-2, so you shouldn't actually run gofasta snps -r KJ642619.fasta -q KJOP/OP351278.1.fasta
because after pairwise alignment those two files are different widths (because insertions have been introduced to KJOP/OP351278.1.fasta in a pairwise manner). Previously, 1) gofasta snps
didn't like reading such a wide alignment anyway which resulted in one of the errors you report above but also 2) there was no sensible error check for different length alignments anyway: I've fixed this now:
❯ gofasta snps -r KJ642619.fasta -q KJOP/OP351278.1.fasta
query,SNPs
Error: Reference sequence (196546 bases) and KJ642619.1 (199815 bases) are different lengths
KJ642619.1 is the first record in KJOP/OP351278.1.fasta
so this is the error message that gets returned.
Until I fix the annotation reading for features on the reverse strand, if you want a list of snp differences in KJ642619.fasta space, you could use gofasta sam toma
, which omits insertions relative to the reference:
❯ minimap2 -a -x asm20 --score-N=0 KJ642619.fasta OP351278.fasta | gofasta sam toma -o aligned.fasta
❯ gofasta snps -r KJ642619.fasta -q aligned.fasta
query,SNPs
OP351278.1,C110T|C186A|C448T...
But I will try to get variants
and sam variants
fixed asap so that you can annotate mutations properly.
Hi,
thanks for that; i also tried with your new code, that error is now gone. Please don't stress with the rest, I went back to my traditional, a bit less, straightforward workaround which uses goes through snpEff (via vcf) for the annotation (although inconvenient) and i'm writing these issues because this repo seems promising.
I can wait for the annotation part to work another time.
Anyway I also tried with the long multifasta, it still breaks around sequence no 31 for some reason, so the problem seems unrelated.
Ah it's no stress- I find it helpful to hear what doesn't work/what would be useful. It is much easier than trying to write everything in a vacuum...
Feel free to drop the long multifasta here if you like.
Sorry for the delay.
I hope that I have fixed all these issues by this point:
https://github.com/virus-evolution/gofasta/tree/0442d351e2c5ffa7a561b1e37aee61a6121c004e
So you can run go install github.com/virus-evolution/gofasta@0442d35
, or
git clone https://github.com/virus-evolution/gofasta.git
cd gofasta
git checkout 0442d35
go build -o gofasta
to get a working binary.
I will tag a new release (1.2.0
) incorporating these changes at some point soon.
Error: strconv.Atoi: parsing "complement(3956": invalid syntax