Closed shenwei356 closed 2 years ago
The GTDB mode (--gtdb
) is compatible, with no changes to the previous version.
And now it can better handle ICTV taxonomy (https://github.com/shenwei356/ictv-taxdump/issues/1)
Usage
Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV
Input format:
0. For GTDB taxonomy file, just use --gtdb.
We use the numeric assembly accession as the taxon at subspecies rank.
(without the prefix GCA_ and GCF_, and version number).
1. The input file should be tab-delimited, at least one column is needed.
2. Ranks can be given either via the first row or the flag --rank-names.
3. The column containing the genome/assembly accession is recommended to
generate TaxId mapping file (taxid.map, id -> taxid).
-A/--field-accession, field contaning genome/assembly accession
--field-accession-re, regular expression to extract the accession
Note that mutiple TaxIds pointing to the same accession are listed as
comma-seperated integers.
Attentions:
1. Names should be distinct in taxa of different ranks.
But for these missing some taxon nodes, using names of parent nodes is allowed:
GB_GCA_018897955.1 d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155
It can also detect duplicate names with different ranks, e.g.,
the Class and Genus have the same name B47-G6, and the Order and Family
between them have different names. In this case, we reassign a new TaxId
by increasing the TaxId until it being distinct.
GB_GCA_003663585.1 d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585
2. Taxa from different parents may have the same name.
We will assign different TaxIds to them.
E.g., in ICTV, many viruses from different species have the same names.
In practice, we set the "Virus names(s)" as a sub-species rank and also
specify it as the accession.
Species Virus name(s)
Jerseyvirus SETP3 Salmonella phage SETP7
Jerseyvirus SETP7 Salmonella phage SETP7
Usage:
taxonkit create-taxdump [flags]
Flags:
-A, --field-accession int field index of assembly accession (genome ID), for outputting taxid.map
--field-accession-re string regular expression to extract assembly accession (default
"^\\w\\w_(.+)$")
--force overwrite existed output directory
--gtdb input files are GTDB taxonomy file
--gtdb-re-subs string regular expression to extract assembly accession as the subspecies
(default "^\\w\\w_GC[AF]_(.+)\\.\\d+$")
-h, --help help for create-taxdump
--line-chunk-size int number of lines to process for each thread, and 4 threads is fast
enough. (default 5000)
--null strings null value of taxa (default [,NULL,NA])
-x, --old-taxdump-dir string taxdump directory of the previous version, for generating merged.dmp
and delnodes.dmp
-O, --out-dir string output directory
-R, --rank-names strings names of all ranks, leave it empty to use the first row of input as
rank names
This issue comes from https://github.com/shenwei356/ictv-taxdump/issues/1.
I think it needs a reimplement of the command or another new command, which accepts arbitrary ranks. It should be easy.
--rank-names
as well, but without the limitation of 8 ranks.