Open danielpodlesny opened 8 months ago
Thanks for the feedback.
I am very glad to see that GTDB has a https://gtdb.ecogenomic.org/taxon-history page!
taxonkit taxid-changelog
was first designed for NCBI taxonomy, in which the changes are more continuous and not as drastic as GTDB. So some results are not satisfying, I'm sorry for this.
I've checked the source code and also some records, like a g__CAG-521 species. I do think I should revise the command someday, after finishing recent work.
Thanks a lot for looking into this already.
So do you see this as a problem in the taxid-changelog
command or in the taxdumps and the lineage
command? Would this change be correctly picked up by lineage
if documented differently in the taxdumps or would this in no case be resolved by this command?
lineage
works fine. It's just the taxid-changelog
, which did not handle some edge cases appropriately.
AS every single version of GTDB-taxonomy, it's correct and there's no known issue, only the deleted.dmp
and merged.dmp
files are not perfect which most tools do not use.
I just released a new version of gtdb-taxdump, which has better support for duplicated names with different ranks. And the taxids are totally changed. (not related to this issue).
(And I return to this issue again before the new release of taxonkit.)
I'm wondering if I can improve it. The answer is no for now. In NCBI taxonomy, the TaxIds are stable, so I can directly check if the taxon names is changed by comparing names in the adjacent two versions. While for GTDB taxonomy, I generate TaxIds from the hash value of
So it's hard to detect renaming events for GTDB taxonomy.
But if we check the change history of an assembly, it's OK, showing CHANGE_LIN_TAX
, meaning there are big changes.
$ grep GCA_003543795.1 gtdb-taxdump/R214/taxid.map
GCA_003543795.1 60618853
$ zcat gtdb-taxid-changelog.csv.gz \
| csvtk grep -f taxid -p 60618853 \
| csvtk cut -f -change-value,-lineage-taxids \
| csvtk pretty -W 40 -x ";" -S light
┌----------┬---------┬----------------┬-----------┬---------┬------------------------------------------┐
| taxid | version | change | name | rank | lineage |
├==========┼=========┼================┼===========┼=========┼==========================================┤
| 60618853 | R089 | NEW | 003543795 | no rank | Bacteria;Proteobacteria; |
| | | | | | Gammaproteobacteria;Burkholderiales; |
| | | | | | Burkholderiaceae;CAG-521; |
| | | | | | CAG-521 sp003543795;003543795 |
├----------┼---------┼----------------┼-----------┼---------┼------------------------------------------┤
| 60618853 | R207 | CHANGE_LIN_TAX | 003543795 | no rank | Bacteria;Proteobacteria; |
| | | | | | Gammaproteobacteria;Burkholderiales; |
| | | | | | Burkholderiaceae;Aphodousia; |
| | | | | | Aphodousia sp003543795;003543795 |
├----------┼---------┼----------------┼-----------┼---------┼------------------------------------------┤
| 60618853 | R214 | CHANGE_LIN_TAX | 003543795 | no rank | Bacteria;Pseudomonadota; |
| | | | | | Gammaproteobacteria;Burkholderiales; |
| | | | | | Burkholderiaceae_A;Aphodousia; |
| | | | | | Aphodousia sp003543795;003543795 |
└----------┴---------┴----------------┴-----------┴---------┴------------------------------------------┘
I also add notes to taxid-changelog
.
$ taxonkit taxid-changelog -h
Create TaxId changelog from dump archives
Attention:
1. This command was originally designed for NCBI taxonomy, where the the TaxIds are stable.
2. For other taxonomic data created by "taxonkit create-taxdump", e.g., GTDB-taxdump,
some change events might be wrong, because
a) There would be dramatic changes between the two versions.
b) Different taxons in multiple versions might have the same TaxIds, because we only
check and eliminate taxid collision within a single version.
So a single version of taxonomic data created by "taxonkit create-taxdump" has no problem,
it's just the changelog might not be perfect.
Note in create-taxdump
:
3. We only check and eliminate taxid collision within a single version of taxonomy data.
Therefore, if you create taxid-changelog with "taxid-changelog", different taxons
in multiple versions might have the same TaxIds and some change events might be wrong.
So a single version of taxonomic data created by "taxonkit create-taxdump" has no problem,
it's just the changelog might not be perfect.
Prerequisites
taxonkit version
Describe your issue
Thanks for developing taxonkit and for sharing the taxdumps! it saves so much trouble.
There was this change in GTDB: R202 "CAG-521" -> R207 "Aphodousia".
I used your latest GTDB taxdump changelog which shows that CAG-521 was DELETED, Aphodousia NEW. However, I'm unable to get the connection that one changed into the other.
Going as per docs I run into this:
I'm not sure whether this is due to the taxdumps or taxonkit, so I post here.
CAG-521
Aphodousia: