shenwei356 / csvtk

A cross-platform, efficient and practical CSV/TSV toolkit in Golang
http://bioinf.shenwei.me/csvtk
MIT License
992 stars 84 forks source link

ERRO #250

Closed zhangkaihui1 closed 11 months ago

zhangkaihui1 commented 11 months ago

Prerequisites

Dear teacher, I need to decontaminate the original sequencing data, and the comparison results are outputed by comparing NT database with blastn. But the result of comparison is only the sequence ID number, I need to know the species name, and then extract the required sequence. Therefore, I used Taxonkit software to obtain the ID number and species name of the family of species I studied in NCBI. The e.T.axid.txt file of 17k, e.T.axid.name.txt of 71K, and 14M e.acession.version_taxid.txt are generated. Now I need to match the id number in the result from blastn to its corresponding species name. So I entered the following command:

.  /csvtk add-header -t --names  "qseqid,sseqid,pident,qlen,length,mismatch,gapopen,qstart,qend,sstart,send,slen,nident,evalue,bitscore,qcovhsp"  /data/zhangkh/E/PB/Data/C36/result \
    | ./csvtk join -t -f "sseqid;    accession.  version"  -L --na "-" - ./E.accession.  version_ taxid.txt \
    | ./csvtk -t join -f taxid -L --na "-" -  txt > c36_taxid_name.txt

but the following error message is displayed :

[ERRO] number of fields (12) and new colnames (16) do not match
[WARN] csvtk join: skipping empty input file: -
[ERRO] column " accession.version" not existed in file: ./Chordata.accession.version_taxid.txt
[WARN] csvtk join: skipping empty input file: -

do you have time to help me see what the problem is and how to solve it?

shenwei356 commented 11 months ago

It's a common task, see here: Add taxonomy information to BLAST result

The error message is quite straightforward: unmatched number of fields and colnames, extract spaces.

A piece of advice here: for large input data, csvtk replace is better than csvtk join, the later one will occupy a lot of memory.