shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
357 stars 29 forks source link

When converting 'taxid' into full taxonomy from prot.accession2taxid , the program terminated after an error is reported #55

Closed Neal050617 closed 2 years ago

Neal050617 commented 2 years ago

Prerequisites

taxonkit v0.9.0 go version go1.17.7 linux/amd64

Describe your issue

wget -c https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz sed '1d' prot.accession2taxid | csvtk cut -t -f 2,3 | taxonkit lineage -i 3 \ | taxonkit reformat -i 3 -f "{k};{p};{c};{o};{f};{g};{s};{t}" -F -P -S -j 24 \ | csvtk cut -t -f 1,2,4 \ | csvtk add-header -t -n accession,taxid,taxonomy > nr.tax

[ERRO] parse error on line 37402419, column 96: bare " in non-quoted-field

while there is no quotation marks detected.

shenwei356 commented 2 years ago

Add the flag -l to csvtk.

Besides, taxonkit reformat accept TaxIds as input, so there's no need to run taxonkit lineage before it.

  -I, --taxid-field int                field index of taxid. input data should be tab-separated. it overrides -i/--lineage-field

Also see the example 1 at usage page.

sed 1d prot.accession2taxid \
    | csvtk cut -l -t -f 2,3 \
    | taxonkit reformat -I 2 -f "{k};{p};{c};{o};{f};{g};{s};{t}" -F -P -S \
    | csvtk cut -l -t -f 1,2,3 \
    | csvtk add-header -l -t -n accession,taxid,taxonomy \
    > nr.tax

while there is no quotation marks detected.

Maybe in some taxonomic names. Here they are:

$ taxonkit list --ids 1 -I "" | taxonkit lineage -L -n -r  | grep '"' | more
1906029 Nostoc sp. 'Peltigera sp. "hawaiensis" P1236 cyanobiont'        species
2727889 Pleurocapsales cyanobacterium 'Beach rock 4+5"' species
1920041 Expression vector "pure" split-T7P564   species
Neal050617 commented 2 years ago

Thanks, Dr. Shen, you wrote such good software, and sorry I didn't follow your tutorial carefully. After using the revised script, a new error message occurred. Thank you for taking the time.

Here is the new error message: ######################################### panic: runtime error: index out of range [0] with length 0

goroutine 124244 [running]: github.com/shenwei356/taxonkit/taxonkit/cmd.glob..func8.1({0xc03b35d398, 0xc03ada2a00}) /home/shenwei/shenwei/scripts/go/src/github.com/shenwei356/taxonkit/taxonkit/cmd/reformat.go:434 +0x1dba github.com/shenwei356/breader.(BufferedReader).run.func2.1({0x7265746361626f72, {0xc03b5a6800, 0x7265686373455f5f, 0x5f733b6169686369}}) /home/shenwei/shenwei/scripts/go/pkg/mod/github.com/shenwei356/breader@v0.3.1/BufferedReader.go:177 +0x1bd created by github.com/shenwei356/breader.(BufferedReader).run.func2 /home/shenwei/shenwei/scripts/go/pkg/mod/github.com/shenwei356/breader@v0.3.1/BufferedReader.go:169 +0xee

shenwei356 commented 2 years ago

Yes, it's a bug, but only occurred for input of deleted taxids with the flag -F/--fill-miss-rank.

You may have used accession2taxid and taxonomy taxdump files that do not match (of different versions), with some taxids in the accession2taxid file been deleted in the taxdump files.

I've fixed it, please use the binaries below.

Neal050617 commented 2 years ago

It worked! Thanks a lot.

SergeyBaikal commented 1 year ago

Dear developers! Could you clarify please is correct? I also had an error, but after adding -l it disappeared. My goal is to count the unique ranks. In the input file, I just have a taxon column.

taxonkit lineage taxid.txt | awk '$2!=""' > lineage.txt
taxonkit reformat lineage.txt | tee lineage.txt.reformat
cut -f 1,3 lineage.txt.reformat

cat lineage.txt \
    | taxonkit reformat  -I 1 -F -f "{f}"\
    | csvtk -l -H -t cut -f 1,3 \
    | csvtk -H -t sep -f 2 -s ';' -R \
    | csvtk add-header -t -n taxid,family\
    | csvtk -t csv2tab  > Family.txt

awk '{$1=""}1' Family.txt | awk '{$1=$1}1' > Family_1col.txt           

cat Family_1col.txt | sort | uniq -c | sort -rn > unic_Family_all.txt
shenwei356 commented 1 year ago

Hi, I'd recommend using commands below:

$ cat taxid.txt \
    | taxonkit reformat -I 1 -f '{f}' \
    | awk '$2!=""' \
    | csvtk freq -Ht -f 2 -nr

22:24:14.716 [WARN] taxid 123124124 not found
22:24:14.716 [WARN] taxid 3 was deleted
22:24:14.716 [WARN] taxid 92489 was merged into 796334
Akkermansiaceae 2
Bovidae 1
Comamonadaceae  1
Erwiniaceae     1
Francisellaceae 1
Hominidae       1
Retroviridae    1
Siphoviridae    1
SergeyBaikal commented 1 year ago

Thank you! Well done. Now it is much better that it was before!

SergeyBaikal commented 1 year ago

Why does the program find only 12 taxa out of 14? What needs to be updated?

137758 137758 64279 64279 137758 1955153 2584979 1673646 103782 137758 2093224 1408133 291286 2786748

Potyviridae 5 Dicistroviridae 2 Closteroviridae 1 Cystoviridae 1 Endornaviridae 1 Nodaviridae 1 Picobirnaviridae 1

shenwei356 commented 1 year ago

It's easy to explain: the lineages of some taxid changed. See https://github.com/shenwei356/taxid-changelog/ . You can check the changes of the TaxIds above. taxid.log.tsv.gz

csvtk grep -f taxid -P taxid.txt taxid-changelog.csv.gz > taxid.log.tsv

So the result could change when using a different version of the NCBI taxdump file.

SergeyBaikal commented 1 year ago

shenwei356 Thanks a lot!