shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
361 stars 29 forks source link

format options in lineage command #6

Closed tolot27 closed 6 years ago

tolot27 commented 6 years ago

It would be great if the lineage command supports the same format options as the reformat command does. That would avoid a second pipe and save processing time, especially for large datasets.

tolot27 commented 6 years ago

The default format string could be {a} for all or {F} for full lineage.

shenwei356 commented 6 years ago

Hi Mathias, I've considered this before, but I think it's better to leave the reformat an independent command to keep the commands modular.

For the speed, the slowest part is parsing the names.dmp and nodes.dmp files. There should be no big difference for large datasets (taxid list file?).

tolot27 commented 6 years ago

You are right for the speed. But at least one additional column is added to the output.

Anyway, it's your decision.

shenwei356 commented 6 years ago

Ha ha, you can discard the original lineage column easily using https://github.com/shenwei356/csvtk .

For example, csvtk cut -f -5 data.csv for removing 5th column, or csvtk cut -t -f -col5 data.tsv for discarding column col5.

tolot27 commented 6 years ago

I know how to use cut or csvtk. It is just an additional pipe/process and more stuff to maintain in the scripts.