Open apcamargo opened 2 years ago
I see. Please try the fixed version:
Previous data:
$ grep 2094520869 gtdb-taxdump0/R095/merged.dmp
351927 | 2094520869 |
$ grep 2094520869 gtdb-taxdump0/R095/delnodes.dmp
2094520869 |
$ grep 351927 gtdb-taxdump0/R095/delnodes.dmp
Fixed data:
$ grep 2094520869 gtdb-taxdump/R089/merged.dmp
351927 | 2094520869 |
$ grep 2094520869 gtdb-taxdump/R089/delnodes.dmp
$ grep 2094520869 gtdb-taxdump/R095/merged.dmp
$ grep 2094520869 gtdb-taxdump/R095/delnodes.dmp
2094520869 |
$ grep 351927 gtdb-taxdump/R095/merged.dmp
$ grep 351927 gtdb-taxdump/R095/delnodes.dmp
351927 |
The history of 2094520869
,
$ zcat gtdb-taxid-changelog.csv.gz | csvtk grep -f taxid -p 2094520869 | csvtk cut -f 1-4 | csvtk pretty
taxid version change change-value
---------- ------- ------ ------------
2094520869 R083 NEW
2094520869 R089 ABSORB 351927
2094520869 R095 DELETE
351927
was merged into 2094520869
in R089. Since 2094520869
was deleted in R095, 351927
was also marked deleted in the same version.
$ zcat gtdb-taxid-changelog.csv.gz | csvtk grep -f taxid -p 351927 | csvtk cut -f 1-4 | csvtk pretty
taxid version change change-value
------ ------- -------------- ------------
351927 R080 NEW
351927 R083 CHANGE_LIN_TAX
351927 R089 MERGE 2094520869
351927 R095 DELETE
351927 was merged into 2094520869 in R089. Since 2094520869 was deleted in R095, 351927 was also marked deleted in the same version.
Similar TaxIds in NCBI taxonomy:
$ zcat taxid-changelog.csv.gz \
| csvtk cut -f taxid,change \
| csvtk fold -f taxid -v change \
| csvtk grep -f change -r -p NEW.+MERGE.+DELETE \
| csvtk pretty
taxid change
------- ----------------------------------
114049 NEW; MERGE; DELETE
137504 NEW; CHANGE_LIN_TAX; MERGE; DELETE
1458427 NEW; MERGE; DELETE; REUSE_DE
$ zcat taxid-changelog.csv.gz | csvtk grep -f taxid -p 114049 -I | csvtk cut -f 1-4 | csvtk pretty
taxid version change change-value
------ ---------- ------ ------------
114049 2013-02-21 NEW
114049 2018-02-01 MERGE 114055
114049 2022-08-01 DELETE
$ zcat taxid-changelog.csv.gz | csvtk grep -f taxid -p 114055 -I | csvtk cut -f 1-4 | csvtk pretty
taxid version change change-value
------ ---------- -------------- ------------
114055 2013-02-21 NEW
114055 2018-02-01 ABSORB 114049
114055 2020-01-01 CHANGE_LIN_LEN
114055 2022-08-01 DELETE
Thanks for the quick response! I'm now getting an error because of a circular definition in merged.dmp
:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/APCamargo-M55/.mambaforge/envs/snakemake/lib/python3.9/site-packages/taxopy/core.py", line 124, in __init__
self._taxid2parent, self._taxid2rank = self._import_nodes()
File "/Users/APCamargo-M55/.mambaforge/envs/snakemake/lib/python3.9/site-packages/taxopy/core.py", line 195, in _import_nodes
taxid2rank[oldtaxid] = taxid2rank[newtaxid]
KeyError: 1372546615
$ rg "1372546615" R207/*
R207/merged.dmp
23790132 | 1372546615 |
1372546615 | 23790132 |
I'd like to say, taxopy
is so cool to detect these problems!
The history of GCF_001405015.1 showed Clostridium disporicum
was renamed to Clostridium disporicum_A
in R95, and changed back in R207. It should be "rename" because there's no Clostridium disporicum
in R95 after checking the bac120_taxonomy_r95.tsv.gz
file.
I think we should mark 1372546615
as "deleted" in R207.
Also see the taxid-changelog:
$ zcat gtdb-taxid-changelog.csv.gz \
| csvtk grep -f taxid -p 23790132 \
| csvtk cut -f taxid,version,change,change-value,lineage \
| csvtk csv2md
taxid | version | change | change-value | lineage |
---|---|---|---|---|
23790132 | R080 | NEW | Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum | |
23790132 | R083 | CHANGE_LIN_TAX | Bacteria;Firmicutes_A;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum | |
23790132 | R095 | MERGE | 1372546615 | Bacteria;Firmicutes_A;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum |
23790132 | R207 | REUSE_MER | Bacteria;Firmicutes_A;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum | |
23790132 | R207 | ABSORB | 1372546615 | Bacteria;Firmicutes_A;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum |
$ zcat gtdb-taxid-changelog.csv.gz \
| csvtk grep -f taxid -p 1372546615 \
| csvtk cut -f taxid,version,change,change-value,lineage \
| csvtk csv2md
taxid | version | change | change-value | lineage |
---|---|---|---|---|
1372546615 | R095 | NEW | Bacteria;Firmicutes_A;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum_A | |
1372546615 | R095 | ABSORB | 23790132 | Bacteria;Firmicutes_A;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum_A |
1372546615 | R207 | MERGE | 23790132 | Bacteria;Firmicutes_A;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum_A |
Thanks again!
The problem now is a chain of taxids being merged:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/APCamargo-M55/.mambaforge/envs/snakemake/lib/python3.9/site-packages/taxopy/core.py", line 124, in __init__
self._taxid2parent, self._taxid2rank = self._import_nodes()
File "/Users/APCamargo-M55/.mambaforge/envs/snakemake/lib/python3.9/site-packages/taxopy/core.py", line 195, in _import_nodes
taxid2rank[oldtaxid] = taxid2rank[newtaxid]
KeyError: 3756237306
$ rg "3756237306" R207/*
R207/merged.dmp
31446350 | 3756237306 |
3756237306 | 4053664265 |
I guess in NCBI they would just remove 3756237306
in cases like this.
$ rg "3756237306" R207/*
R207/delnodes.dmp
1962:3756237306 |
$ rg "31446350" R207/*
R207/merged.dmp
43:31446350 | 4053664265 |
Worked! :)
Finally ~~ I didn't think it would be so completed ~ Thank you for reporting, Antônio!
Please let me know if you found more issues. I might release the new version next week.
Sure! Thanks for being so quick to fix this!
Hi, Antônio. Everything is right? I'm going to tag new releases for taxonkit and gtdb-taxdump.
Hmm, there are still some issues ... will check it tomorrow.
I didn't have any problem with the last round of fixes. But taxopy
isn't looking at the deleted nodes, so I can't be sure there's nothing wrong there.
Added checking another situation: the old taxid is merged to a new taxid.
It's really complicated :(. But I think all possible changes should have been taken into consideration now.
This gets really convoluted at this point. I think taxonkit's logic is as close as we can get to a "documentation"
I'm still fixing it.
I got it.
For the case of " a chain of taxids being merged"
// previous: A -> B
// current : B -> C
Previously:
// change A -> C, delete B->C, and mark B as deleted
Fixed:
// change A -> C, and keep B -> C
I'm exhausted, leave it another day.
I think taxonkit's logic is as close as we can get to a "documentation"
I don't get it.
I just meant that (as far as I know) NCBI doesn't provide a documentation for the taxdump format. So the logic you came up with to generate files that are compliant with NCBI's is the best thing an external user can get to a documentation of NCBI's format.
Here's the way how I defined a merging event.
// tree is a hash table mapping a child taxid to its parent.
// For GTDB taxonomy, genome assembly accessions were also assigned with taxids
// with the lowest rank subspecies.
foreach child, parent in tree:
if child existed in old_tree: // 1) the child was not new
old_parent = old_tree[child]
if parent != old_parent // 2) and its parent changed
and tree[parent] == old_tree[old_parent] // 3) while its grandparent kept the same
and old_parent not existed in tree: // 4) and the old parent disappeared
define: old_parent was merged to parent
But yesterday I found the merge.dmp
and deleted.dmp
results were unstable:
In one run, 123112611 was merged to 3095577279.
123112611 R086 Sphingobium japonicum_A
3095577279 | Sphingobium indicum | | scientific name |
2985200482 R089 Sphingobium japonicum_B
3095577279 | Sphingobium indicum | | scientific name |
2905153001 R089 Thiomargarita nelsonii
2729049807 | Thiomargarita nelsonii_A | | scientific name |
3740430475 R089 Staphylococcus schleiferi
3976168905 | Staphylococcus coagulans | | scientific name |
While in another run, 123112611 was merged to 251684323.
123112611 R086 Sphingobium japonicum_A
251684323 | Sphingobium chinhatense | | scientific name |
2985200482 R089 Sphingobium japonicum_B
251684323 | Sphingobium chinhatense | | scientific name |
2905153001 R089 Thiomargarita nelsonii
1691304376 | Thiomargarita nelsonii_B | | scientific name |
3740430475 R089 Staphylococcus schleiferi
3999988872 | Staphylococcus fleurettii | | scientific name |
Let's check the genomes:
$ taxonkit --data-dir R086/ list --ids 123112611 -I "" | csvtk join -Ht -f '1;2' - R086/taxid.map
1950510646 GCF_000445085.1
3414476628 GCF_000091125.1
$ taxonkit --data-dir R086/ list --ids 123112611 -I "" | sed 1d | taxonkit lineage --data-dir R095 -t | csvtk pretty -Ht
1950510646 Bacteria;Proteobacteria;Alphaproteobacteria;Sphingomonadales;Sphingomonadaceae;Sphingobium;Sphingobium chinhatense;000445085 609216830;3788559933;3294429563;1352939149;838835488;4212071235;251684323;1950510646
3414476628 Bacteria;Proteobacteria;Alphaproteobacteria;Sphingomonadales;Sphingomonadaceae;Sphingobium;Sphingobium indicum;000091125 609216830;3788559933;3294429563;1352939149;838835488;4212071235;3095577279;3414476628
It turns out that in R95 some (_Sphingobium japonicumA) genomes (GCF_000445085.1) were merged into (Sphingobium chinhatense), while others (GCF_000091125.1) into Sphingobium indicum. BTW, Sphingobium chinhatense only appeared in R95 not others.
I don't know how to appropriately handle this case, any idea? @apcamargo
As far as I remember (and also the release notes), there was no change in the way species were delimited in r89 and r95. It's just that the centroid can change between different releases which makes this sort of thing happen. Is that right, @dparks1134?
I can try to come up with a smarter solution tomorrow. Right now, the simplest thing that came to my mind is to not make a taxon depend on a single genome. In NCBI, even if the taxonomy of a single genome changed from A to B, taxon A would not necessarily be merged into B. But (as far as I know), they do all the merging/deletion manually, which is something that defeats the purpose of taxonkit.
So, one possible solution is to make a taxon depend on a "consensus". Let's say you have taxon A with 5 members and a single one changes to taxon B. Regardless of the order the genomes are processed,taxon A would not merge into B because the majority is still there. If 4 genomes were assigned to B and 1 to C, taxon A would be merged into B (nor perfect, but the most parsimonious merge). If 4 genomes moved to B and 1 remained in A, A would not merge with B.
Does that makes sense? There might be some edge cases, I'll try to think of something more robust tomorrow.
Hi. Yes, the representative (centroid) genome use to define a GTDB species clusters can change between releases. This is a rare event and only done when a sufficiently better genome assembly becomes available, but it does happen. You can find the exact criteria used to nominate a new representative in the "Updating GTDB species cluster" section of our NAR manuscript: https://academic.oup.com/nar/article/50/D1/D785/6370255
I'm trying to parse the GTDB r207 taxdump with
taxopy
and I got the following error:I've never had this problem with NCBI before, so my guess is that they remove deleted nodes from
merged.dmp
.