Closed Neal050617 closed 2 years ago
Add the flag -l
to csvtk.
Besides, taxonkit reformat
accept TaxIds as input, so there's no need to run taxonkit lineage
before it.
-I, --taxid-field int field index of taxid. input data should be tab-separated. it overrides -i/--lineage-field
Also see the example 1 at usage page.
sed 1d prot.accession2taxid \
| csvtk cut -l -t -f 2,3 \
| taxonkit reformat -I 2 -f "{k};{p};{c};{o};{f};{g};{s};{t}" -F -P -S \
| csvtk cut -l -t -f 1,2,3 \
| csvtk add-header -l -t -n accession,taxid,taxonomy \
> nr.tax
while there is no quotation marks detected.
Maybe in some taxonomic names. Here they are:
$ taxonkit list --ids 1 -I "" | taxonkit lineage -L -n -r | grep '"' | more
1906029 Nostoc sp. 'Peltigera sp. "hawaiensis" P1236 cyanobiont' species
2727889 Pleurocapsales cyanobacterium 'Beach rock 4+5"' species
1920041 Expression vector "pure" split-T7P564 species
Thanks, Dr. Shen, you wrote such good software, and sorry I didn't follow your tutorial carefully. After using the revised script, a new error message occurred. Thank you for taking the time.
Here is the new error message: ######################################### panic: runtime error: index out of range [0] with length 0
goroutine 124244 [running]: github.com/shenwei356/taxonkit/taxonkit/cmd.glob..func8.1({0xc03b35d398, 0xc03ada2a00}) /home/shenwei/shenwei/scripts/go/src/github.com/shenwei356/taxonkit/taxonkit/cmd/reformat.go:434 +0x1dba github.com/shenwei356/breader.(BufferedReader).run.func2.1({0x7265746361626f72, {0xc03b5a6800, 0x7265686373455f5f, 0x5f733b6169686369}}) /home/shenwei/shenwei/scripts/go/pkg/mod/github.com/shenwei356/breader@v0.3.1/BufferedReader.go:177 +0x1bd created by github.com/shenwei356/breader.(BufferedReader).run.func2 /home/shenwei/shenwei/scripts/go/pkg/mod/github.com/shenwei356/breader@v0.3.1/BufferedReader.go:169 +0xee
Yes, it's a bug, but only occurred for input of deleted taxids with the flag -F/--fill-miss-rank
.
You may have used accession2taxid
and taxonomy taxdump files that do not match (of different versions), with some taxids in the accession2taxid
file been deleted in the taxdump files.
I've fixed it, please use the binaries below.
It worked! Thanks a lot.
Dear developers! Could you clarify please is correct? I also had an error, but after adding -l
it disappeared. My goal is to count the unique ranks. In the input file, I just have a taxon column.
taxonkit lineage taxid.txt | awk '$2!=""' > lineage.txt
taxonkit reformat lineage.txt | tee lineage.txt.reformat
cut -f 1,3 lineage.txt.reformat
cat lineage.txt \
| taxonkit reformat -I 1 -F -f "{f}"\
| csvtk -l -H -t cut -f 1,3 \
| csvtk -H -t sep -f 2 -s ';' -R \
| csvtk add-header -t -n taxid,family\
| csvtk -t csv2tab > Family.txt
awk '{$1=""}1' Family.txt | awk '{$1=$1}1' > Family_1col.txt
cat Family_1col.txt | sort | uniq -c | sort -rn > unic_Family_all.txt
Hi, I'd recommend using commands below:
$ cat taxid.txt \
| taxonkit reformat -I 1 -f '{f}' \
| awk '$2!=""' \
| csvtk freq -Ht -f 2 -nr
22:24:14.716 [WARN] taxid 123124124 not found
22:24:14.716 [WARN] taxid 3 was deleted
22:24:14.716 [WARN] taxid 92489 was merged into 796334
Akkermansiaceae 2
Bovidae 1
Comamonadaceae 1
Erwiniaceae 1
Francisellaceae 1
Hominidae 1
Retroviridae 1
Siphoviridae 1
Thank you! Well done. Now it is much better that it was before!
Why does the program find only 12 taxa out of 14? What needs to be updated?
137758 137758 64279 64279 137758 1955153 2584979 1673646 103782 137758 2093224 1408133 291286 2786748
Potyviridae 5 Dicistroviridae 2 Closteroviridae 1 Cystoviridae 1 Endornaviridae 1 Nodaviridae 1 Picobirnaviridae 1
It's easy to explain: the lineages of some taxid changed. See https://github.com/shenwei356/taxid-changelog/ . You can check the changes of the TaxIds above. taxid.log.tsv.gz
csvtk grep -f taxid -P taxid.txt taxid-changelog.csv.gz > taxid.log.tsv
So the result could change when using a different version of the NCBI taxdump file.
shenwei356 Thanks a lot!
Prerequisites
taxonkit v0.9.0 go version go1.17.7 linux/amd64
Describe your issue
wget -c https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz sed '1d' prot.accession2taxid | csvtk cut -t -f 2,3 | taxonkit lineage -i 3 \ | taxonkit reformat -i 3 -f "{k};{p};{c};{o};{f};{g};{s};{t}" -F -P -S -j 24 \ | csvtk cut -t -f 1,2,4 \ | csvtk add-header -t -n accession,taxid,taxonomy > nr.tax
[ERRO] parse error on line 37402419, column 96: bare " in non-quoted-field
while there is no quotation marks detected.