shenwei356 / gtdb-taxdump

GTDB taxonomy taxdump files with trackable TaxIds
MIT License
46 stars 2 forks source link

Deleted nodes are kept in merged.dmp #2

Open apcamargo opened 2 years ago

apcamargo commented 2 years ago

I'm trying to parse the GTDB r207 taxdump with taxopy and I got the following error:

taxdb = taxopy.TaxDb(taxdb_dir="gtdb-taxdump/R207")
KeyError                                  Traceback (most recent call last)
/var/folders/h4/7rmy02997p3_xb4qgjzx257c0000gp/T/ipykernel_43701/673983918.py in <module>
----> 1 taxdb = taxopy.TaxDb(taxdb_dir="/Users/APCamargo-M55/Downloads/ISOLATE_TAXONOMY/gtdb-taxdump/R207")

~/.mambaforge/envs/snakemake/lib/python3.9/site-packages/taxopy/core.py in __init__(self, taxdb_dir, nodes_dmp, names_dmp, merged_dmp, keep_files)
    122         self._oldtaxid2newtaxid = self._import_merged() if self._merged_dmp else None
    123         # Create the taxid2parent, taxid2rank, and taxid2name dictionaries:
--> 124         self._taxid2parent, self._taxid2rank = self._import_nodes()
    125         self._taxid2name = self._import_names()
    126         # Delete temporary files if `keep_files` is set to `False`, unless

~/.mambaforge/envs/snakemake/lib/python3.9/site-packages/taxopy/core.py in _import_nodes(self)
    193         if self._merged_dmp:
    194             for oldtaxid, newtaxid in self._oldtaxid2newtaxid.items():
--> 195                 taxid2rank[oldtaxid] = taxid2rank[newtaxid]
    196                 taxid2parent[oldtaxid] = taxid2parent[newtaxid]
    197         return taxid2parent, taxid2rank

KeyError: 2094520869

I've never had this problem with NCBI before, so my guess is that they remove deleted nodes from merged.dmp.

shenwei356 commented 2 years ago

I see. Please try the fixed version:

Previous data:

$  grep 2094520869 gtdb-taxdump0/R095/merged.dmp 
351927  |       2094520869      |
$ grep 2094520869 gtdb-taxdump0/R095/delnodes.dmp
2094520869      |
$ grep 351927 gtdb-taxdump0/R095/delnodes.dmp

Fixed data:

$ grep 2094520869 gtdb-taxdump/R089/merged.dmp 
351927  |       2094520869      |
$ grep 2094520869 gtdb-taxdump/R089/delnodes.dmp

$ grep 2094520869 gtdb-taxdump/R095/merged.dmp
$ grep 2094520869 gtdb-taxdump/R095/delnodes.dmp
2094520869      |

$ grep 351927 gtdb-taxdump/R095/merged.dmp 
$ grep 351927 gtdb-taxdump/R095/delnodes.dmp 
351927  |

The history of 2094520869,

$ zcat gtdb-taxid-changelog.csv.gz | csvtk grep -f taxid -p 2094520869 | csvtk cut -f 1-4 | csvtk pretty 
taxid        version   change   change-value
----------   -------   ------   ------------
2094520869   R083      NEW      
2094520869   R089      ABSORB   351927
2094520869   R095      DELETE 

351927 was merged into 2094520869 in R089. Since 2094520869 was deleted in R095, 351927 was also marked deleted in the same version.

$ zcat gtdb-taxid-changelog.csv.gz | csvtk grep -f taxid -p 351927 | csvtk cut -f 1-4 | csvtk pretty 
taxid    version   change           change-value
------   -------   --------------   ------------
351927   R080      NEW              
351927   R083      CHANGE_LIN_TAX   
351927   R089      MERGE            2094520869
351927   R095      DELETE 
shenwei356 commented 2 years ago

351927 was merged into 2094520869 in R089. Since 2094520869 was deleted in R095, 351927 was also marked deleted in the same version.

Similar TaxIds in NCBI taxonomy:

$ zcat taxid-changelog.csv.gz \
    | csvtk cut -f taxid,change \
    | csvtk fold -f taxid -v change \
    | csvtk grep -f change -r -p NEW.+MERGE.+DELETE \
    | csvtk pretty
taxid     change
-------   ----------------------------------
114049    NEW; MERGE; DELETE
137504    NEW; CHANGE_LIN_TAX; MERGE; DELETE
1458427   NEW; MERGE; DELETE; REUSE_DE

$ zcat taxid-changelog.csv.gz | csvtk grep -f taxid -p 114049 -I | csvtk cut -f 1-4 | csvtk pretty 
taxid    version      change   change-value
------   ----------   ------   ------------
114049   2013-02-21   NEW      
114049   2018-02-01   MERGE    114055
114049   2022-08-01   DELETE

$ zcat taxid-changelog.csv.gz | csvtk grep -f taxid -p 114055 -I | csvtk cut -f 1-4 | csvtk pretty 
taxid    version      change           change-value
------   ----------   --------------   ------------
114055   2013-02-21   NEW              
114055   2018-02-01   ABSORB           114049
114055   2020-01-01   CHANGE_LIN_LEN   
114055   2022-08-01   DELETE
apcamargo commented 2 years ago

Thanks for the quick response! I'm now getting an error because of a circular definition in merged.dmp:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/APCamargo-M55/.mambaforge/envs/snakemake/lib/python3.9/site-packages/taxopy/core.py", line 124, in __init__
    self._taxid2parent, self._taxid2rank = self._import_nodes()
  File "/Users/APCamargo-M55/.mambaforge/envs/snakemake/lib/python3.9/site-packages/taxopy/core.py", line 195, in _import_nodes
    taxid2rank[oldtaxid] = taxid2rank[newtaxid]
KeyError: 1372546615
$ rg "1372546615" R207/*
R207/merged.dmp
23790132    |   1372546615  |
1372546615  |   23790132    |
shenwei356 commented 2 years ago

I'd like to say, taxopy is so cool to detect these problems!

The history of GCF_001405015.1 showed Clostridium disporicum was renamed to Clostridium disporicum_A in R95, and changed back in R207. It should be "rename" because there's no Clostridium disporicum in R95 after checking the bac120_taxonomy_r95.tsv.gz file.

I think we should mark 1372546615 as "deleted" in R207.

Also see the taxid-changelog:

$ zcat gtdb-taxid-changelog.csv.gz \
    | csvtk grep -f taxid -p 23790132 \
    | csvtk cut -f taxid,version,change,change-value,lineage \
    | csvtk csv2md 
taxid version change change-value lineage
23790132 R080 NEW Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum
23790132 R083 CHANGE_LIN_TAX Bacteria;Firmicutes_A;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum
23790132 R095 MERGE 1372546615 Bacteria;Firmicutes_A;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum
23790132 R207 REUSE_MER Bacteria;Firmicutes_A;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum
23790132 R207 ABSORB 1372546615 Bacteria;Firmicutes_A;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum
$ zcat gtdb-taxid-changelog.csv.gz \
    | csvtk grep -f taxid -p 1372546615 \
    | csvtk cut -f taxid,version,change,change-value,lineage \
    | csvtk csv2md 
taxid version change change-value lineage
1372546615 R095 NEW Bacteria;Firmicutes_A;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum_A
1372546615 R095 ABSORB 23790132 Bacteria;Firmicutes_A;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum_A
1372546615 R207 MERGE 23790132 Bacteria;Firmicutes_A;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium disporicum_A
shenwei356 commented 2 years ago

Hope there's no bug.

The change histories were not changed for 23790132 and 1372546615.

apcamargo commented 2 years ago

Thanks again!

The problem now is a chain of taxids being merged:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/APCamargo-M55/.mambaforge/envs/snakemake/lib/python3.9/site-packages/taxopy/core.py", line 124, in __init__
    self._taxid2parent, self._taxid2rank = self._import_nodes()
  File "/Users/APCamargo-M55/.mambaforge/envs/snakemake/lib/python3.9/site-packages/taxopy/core.py", line 195, in _import_nodes
    taxid2rank[oldtaxid] = taxid2rank[newtaxid]
KeyError: 3756237306
$ rg "3756237306" R207/*
R207/merged.dmp
31446350    |   3756237306  |
3756237306  |   4053664265  |

I guess in NCBI they would just remove 3756237306 in cases like this.

shenwei356 commented 2 years ago
$ rg "3756237306" R207/*
R207/delnodes.dmp
1962:3756237306 |

$ rg "31446350" R207/*
R207/merged.dmp
43:31446350     |       4053664265      |

R207.tar.gz

apcamargo commented 2 years ago

Worked! :)

shenwei356 commented 2 years ago

Finally ~~ I didn't think it would be so completed ~ Thank you for reporting, Antônio!

Please let me know if you found more issues. I might release the new version next week.

apcamargo commented 2 years ago

Sure! Thanks for being so quick to fix this!

shenwei356 commented 2 years ago

Hi, Antônio. Everything is right? I'm going to tag new releases for taxonkit and gtdb-taxdump.

shenwei356 commented 2 years ago

Hmm, there are still some issues ... will check it tomorrow.

apcamargo commented 2 years ago

I didn't have any problem with the last round of fixes. But taxopy isn't looking at the deleted nodes, so I can't be sure there's nothing wrong there.

shenwei356 commented 2 years ago

Added checking another situation: the old taxid is merged to a new taxid.

It's really complicated :(. But I think all possible changes should have been taken into consideration now.

apcamargo commented 2 years ago

This gets really convoluted at this point. I think taxonkit's logic is as close as we can get to a "documentation"

shenwei356 commented 2 years ago

I'm still fixing it.

shenwei356 commented 2 years ago

I got it.

For the case of " a chain of taxids being merged"

// previous: A -> B
// current : B -> C

Previously:

// change A -> C, delete B->C, and mark B as deleted

Fixed:

// change A -> C, and keep B -> C
shenwei356 commented 2 years ago

I'm exhausted, leave it another day.

I think taxonkit's logic is as close as we can get to a "documentation"

I don't get it.

apcamargo commented 2 years ago

I just meant that (as far as I know) NCBI doesn't provide a documentation for the taxdump format. So the logic you came up with to generate files that are compliant with NCBI's is the best thing an external user can get to a documentation of NCBI's format.

shenwei356 commented 2 years ago

Here's the way how I defined a merging event.

// tree is a hash table mapping a child taxid to its parent.
// For GTDB taxonomy, genome assembly accessions were also assigned with taxids
// with the lowest rank subspecies.

foreach child, parent in tree:
    if child existed in old_tree:                     // 1) the child was not new
        old_parent = old_tree[child]
        if parent != old_parent                       // 2) and its parent changed
            and tree[parent] == old_tree[old_parent]  // 3) while its grandparent kept the same
            and old_parent not existed in tree:       // 4) and the old parent disappeared

            define: old_parent was merged to parent

But yesterday I found the merge.dmp and deleted.dmp results were unstable:

R095

In one run, 123112611 was merged to 3095577279.

123112611   R086 Sphingobium japonicum_A
3095577279      |       Sphingobium indicum     |               |       scientific name |

2985200482  R089 Sphingobium japonicum_B
3095577279      |       Sphingobium indicum     |               |       scientific name |

2905153001  R089 Thiomargarita nelsonii
2729049807      |       Thiomargarita nelsonii_A        |               |       scientific name |

3740430475  R089 Staphylococcus schleiferi
3976168905      |       Staphylococcus coagulans        |               |       scientific name |

While in another run, 123112611 was merged to 251684323.

123112611   R086 Sphingobium japonicum_A
251684323       |       Sphingobium chinhatense |               |       scientific name |

2985200482  R089 Sphingobium japonicum_B
251684323       |       Sphingobium chinhatense |               |       scientific name |

2905153001  R089 Thiomargarita nelsonii
1691304376      |       Thiomargarita nelsonii_B        |               |       scientific name |

3740430475  R089 Staphylococcus schleiferi
3999988872      |       Staphylococcus fleurettii       |               |       scientific name |

Let's check the genomes:

$ taxonkit --data-dir R086/ list  --ids 123112611 -I "" | csvtk join -Ht -f '1;2' - R086/taxid.map 
1950510646      GCF_000445085.1
3414476628      GCF_000091125.1

$ taxonkit --data-dir R086/ list  --ids 123112611 -I "" | sed 1d |  taxonkit lineage --data-dir R095 -t | csvtk pretty -Ht
1950510646   Bacteria;Proteobacteria;Alphaproteobacteria;Sphingomonadales;Sphingomonadaceae;Sphingobium;Sphingobium chinhatense;000445085   609216830;3788559933;3294429563;1352939149;838835488;4212071235;251684323;1950510646
3414476628   Bacteria;Proteobacteria;Alphaproteobacteria;Sphingomonadales;Sphingomonadaceae;Sphingobium;Sphingobium indicum;000091125       609216830;3788559933;3294429563;1352939149;838835488;4212071235;3095577279;3414476628

It turns out that in R95 some (_Sphingobium japonicumA) genomes (GCF_000445085.1) were merged into (Sphingobium chinhatense), while others (GCF_000091125.1) into Sphingobium indicum. BTW, Sphingobium chinhatense only appeared in R95 not others.

I don't know how to appropriately handle this case, any idea? @apcamargo

apcamargo commented 2 years ago

As far as I remember (and also the release notes), there was no change in the way species were delimited in r89 and r95. It's just that the centroid can change between different releases which makes this sort of thing happen. Is that right, @dparks1134?

I can try to come up with a smarter solution tomorrow. Right now, the simplest thing that came to my mind is to not make a taxon depend on a single genome. In NCBI, even if the taxonomy of a single genome changed from A to B, taxon A would not necessarily be merged into B. But (as far as I know), they do all the merging/deletion manually, which is something that defeats the purpose of taxonkit.

So, one possible solution is to make a taxon depend on a "consensus". Let's say you have taxon A with 5 members and a single one changes to taxon B. Regardless of the order the genomes are processed,taxon A would not merge into B because the majority is still there. If 4 genomes were assigned to B and 1 to C, taxon A would be merged into B (nor perfect, but the most parsimonious merge). If 4 genomes moved to B and 1 remained in A, A would not merge with B.

Does that makes sense? There might be some edge cases, I'll try to think of something more robust tomorrow.

donovan-h-parks commented 2 years ago

Hi. Yes, the representative (centroid) genome use to define a GTDB species clusters can change between releases. This is a rare event and only done when a sufficiently better genome assembly becomes available, but it does happen. You can find the exact criteria used to nominate a new representative in the "Updating GTDB species cluster" section of our NAR manuscript: https://academic.oup.com/nar/article/50/D1/D785/6370255