shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
357 stars 29 forks source link

create-taxdump: accepts arbitrary ranks #60

Closed shenwei356 closed 2 years ago

shenwei356 commented 2 years ago

This issue comes from https://github.com/shenwei356/ictv-taxdump/issues/1.

The highest rank in the ICTV taxonomy is the "realm", which is now being ignored in the ictv-taxdump. Because the taxonkit create-taxdump command only supports a fixed number of ranks, there's no way to include it without removing other ranks. Because having the realm is (to my purposes) usually more important

Ideally, we would have a taxdump that includes all the ICTV ranks (including subgenus, subfamily, suborder, etc.), but this might conflict with taxonkit's philosophy of using NCBI's "canonical ranks".

I think it needs a reimplement of the command or another new command, which accepts arbitrary ranks. It should be easy.

shenwei356 commented 2 years ago

The GTDB mode (--gtdb) is compatible, with no changes to the previous version.

And now it can better handle ICTV taxonomy (https://github.com/shenwei356/ictv-taxdump/issues/1)

Usage

Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV

Input format: 
  0. For GTDB taxonomy file, just use --gtdb.
     We use the numeric assembly accession as the taxon at subspecies rank.
     (without the prefix GCA_ and GCF_, and version number).
  1. The input file should be tab-delimited, at least one column is needed.
  2. Ranks can be given either via the first row or the flag --rank-names.
  3. The column containing the genome/assembly accession is recommended to
     generate TaxId mapping file (taxid.map, id -> taxid).
       -A/--field-accession,    field contaning genome/assembly accession      
       --field-accession-re,    regular expression to extract the accession
     Note that mutiple TaxIds pointing to the same accession are listed as
     comma-seperated integers. 

Attentions:
  1. Names should be distinct in taxa of different ranks.
     But for these missing some taxon nodes, using names of parent nodes is allowed:

       GB_GCA_018897955.1      d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155

     It can also detect duplicate names with different ranks, e.g.,
     the Class and Genus have the same name B47-G6, and the Order and Family
     between them have different names. In this case, we reassign a new TaxId
     by increasing the TaxId until it being distinct.

       GB_GCA_003663585.1      d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585

  2. Taxa from different parents may have the same name.
     We will assign different TaxIds to them. 

     E.g., in ICTV, many viruses from different species have the same names.
     In practice, we set the "Virus names(s)" as a sub-species rank and also
     specify it as the accession.

       Species             Virus name(s)
       Jerseyvirus SETP3   Salmonella phage SETP7
       Jerseyvirus SETP7   Salmonella phage SETP7

Usage:
  taxonkit create-taxdump [flags] 

Flags:
  -A, --field-accession int         field index of assembly accession (genome ID), for outputting taxid.map
      --field-accession-re string   regular expression to extract assembly accession (default
                                    "^\\w\\w_(.+)$")
      --force                       overwrite existed output directory
      --gtdb                        input files are GTDB taxonomy file
      --gtdb-re-subs string         regular expression to extract assembly accession as the subspecies
                                    (default "^\\w\\w_GC[AF]_(.+)\\.\\d+$")
  -h, --help                        help for create-taxdump
      --line-chunk-size int         number of lines to process for each thread, and 4 threads is fast
                                    enough. (default 5000)
      --null strings                null value of taxa (default [,NULL,NA])
  -x, --old-taxdump-dir string      taxdump directory of the previous version, for generating merged.dmp
                                    and delnodes.dmp
  -O, --out-dir string              output directory
  -R, --rank-names strings          names of all ranks, leave it empty to use the first row of input as
                                    rank names