larssnip / HumGut

A healthy human gut genome collection
22 stars 1 forks source link

Updating genomes taxonomy and adding other viruses and fungi genomes #4

Open Lucas-Maciel opened 2 years ago

Lucas-Maciel commented 2 years ago

Hi,

I'm very much interested in using the HumGut database, but I have a few questions about the taxonomy files.

1) I wanted to include the Kraken DB for viruses and fungi to explore these other kingdoms as well, however, I'm worried that the taxonomy files might not be compatible (the format seems a bit different). Do you have any tips on how to integrate the taxonomy files?

2) I also wanted to update the NCBI and GTDB taxonomy classification of the genomes to check species that were recently split. But I'm also not sure what it's the best way to do that in order to keep the structure you built.

Thank in you advance.

larssnip commented 2 years ago

Hi,

Thanks for this. The taxonomy files are actually the same as used by NCBI Taxonomy, but the nodes-file contain only 3 columns while the corresponding file from NCBI Taxonomy has more. However, kraken2 uses only these first 3 columns.

In principle it should be possible to just take the NCBI names.dmp file and append the HumGut version of names.dmp to this to create a new, extended, names.dmp. However, then some taxa will be listed twice, those both in the NCBI Taxonomy and the HumGut version, and these duplicates should probably be eliminated. The same applies to the nodes.dmp files.

I will update the entire HumGut collection shortly, and then I will test this as well, and extend the recipe for using kraken2.

Lucas-Maciel commented 2 years ago

Thanks for your fast reply. I'll work on that ;)

When you say shortly, do you have any estimation? Just so I can have an idea.

Thanks again.

larssnip commented 2 years ago

I hope early July, when most students are gone...;)