shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
369 stars 29 forks source link

taxonkit reformat taxid 0 not found #79

Closed Krasnopeev closed 1 year ago

Krasnopeev commented 1 year ago

Hi there!

I try to classify my ASV after 16S/18S gene sequncing with kraken and get lineage for each taxid.

here is my pipe:

cat kraken2_console_out.tsv \
 | csvtk cut -Ht -f 2,3 \
 | taxonkit reformat -I 2 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}' -r "Unclassified" \
 | csvtk add-header -t -n seq,taxid,kindom,phylum,class,order,family,genus,species,strain > kraken2_console_out_taxonomy.tsv

kraken2_console_out.tsv looks like:

C   ASV0000001  1959104 199 0:5 338190:5 0:11 338190:2 0:15 1959104:2 338190:9 0:32 1959104:9 0:13 1959104:3 0:23 651137:7 0:10 651137:3 0:4 651137:5 0:7
C   ASV0000002  222543  226 0:2 131567:6 0:1 222543:3 0:6 222543:5 0:169
C   ASV0000003  222543  195 0:2 131567:6 0:1 222543:3 0:6 222543:5 0:138
C   ASV0000004  1959104 199 0:5 338190:5 0:11 338190:2 0:15 1959104:2 338190:9 0:32 1959104:9 0:13 1959104:3 0:23 651137:7 0:10 651137:3 0:4 651137:5 0:7
C   ASV0000005  651137  204 0:28 338190:1 0:27 651137:2 0:8 651137:1 0:15 651137:5 0:17 651137:5 0:12 651137:4 0:42 1959104:1 0:2
C   ASV0000006  651137  204 0:28 338190:1 0:27 651137:2 0:8 651137:1 0:15 651137:5 0:17 651137:5 0:12 651137:4 0:42 1959104:1 0:2
C   ASV0000007  1959104 221 0:5 338190:5 0:11 338190:2 0:15 1959104:2 338190:9 0:32 1959104:9 0:13 1959104:3 0:23 651137:7 0:10 651137:3 0:4 651137:5 0:11 651137:1 0:17
C   ASV0000008  222543  253 0:2 131567:6 0:1 222543:3 0:6 222543:5 0:196
C   ASV0000009  1959104 181 0:5 338190:5 0:11 338190:2 0:15 1959104:2 338190:9 0:32 1959104:9 0:13 1959104:3 0:23 651137:7 0:10 651137:1
U   ASV0000010  0   214 0:10 55601:2 0:162 131567:3 0:3

after this line | taxonkit reformat -I 2 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}' -r "Unclassified" \ I recive in console:

...
14:11:29.082 [WARN] taxid 0 not found
14:11:29.082 [WARN] taxid 0 not found
14:11:29.082 [WARN] taxid 0 not found
14:11:29.082 [WARN] taxid 0 not found
14:11:29.082 [WARN] taxid 0 not found
14:11:29.082 [WARN] taxid 0 not found
14:11:29.082 [WARN] taxid 0 not found
14:11:29.082 [WARN] taxid 0 not found
14:11:29.082 [WARN] taxid 0 not found
14:11:29.082 [WARN] taxid 0 not found
...

Ok, but when I try to add header with | csvtk add-header -t -n seq,taxid,kindom,phylum,class,order,family,genus,species,strain > kraken2_console_out_taxonomy.tsv

I recive

[ERRO] record on line 10: wrong number of fields

Ez way is to skip all lines with taxid 0 but I need to keep them for downstream analysis. That is a problem.

How can I do that?

Thanks!

shenwei356 commented 1 year ago

Thanks for using TaxonKit. Now it outputs the same format for TaxIds not found in the database, and the missing default values can also be set with -r and R.

  -r, --miss-rank-repl string          replacement string for missing rank
  -R, --miss-taxid-repl string         replacement string for missing taxid

Examples:

$ echo -ne  "562\n0"  \
    | taxonkit  reformat -I 1 -f '{p}\t{s}' \
    | csvtk pretty -Ht
15:47:49.478 [WARN] taxid 0 not found
562   Proteobacteria   Escherichia coli
0

$ echo -ne  "562\n0" \
    | taxonkit  reformat -I 1 -f '{p}\t{s}' -t -r / -R 0 \
    | csvtk pretty -Ht
15:48:39.860 [WARN] taxid 0 not found
562   Proteobacteria   Escherichia coli   1224   562
0     /                /                  0      0

taxonkit_linux_amd64.tar.gz

Krasnopeev commented 1 year ago

It works! Thanks a lot!