larssnip / HumGut

A healthy human gut genome collection
22 stars 1 forks source link

Build a Kraken2 db, adding groups not included in HumGut #3

Open mjgommo opened 2 years ago

mjgommo commented 2 years ago

Dear developers,

I would like to use HumGut and Kraken2 to analyze some human gut microbiome data, and expanding the database to include other taxonomical groups not covered by HumGut, such as Protozoa and Fungi, would be very helpful.

Data on those groups can be directly obtained with "kraken2-build --download-library" but, of course, they are not covered by the taxonomy that is available from the HumGut site.

Would it be possible to do include such groups in a combined database? How?

Thanks a lot in advance

larssnip commented 2 years ago

Thanks for this.

As you point out out, the bottleneck here is the taxonomy. The names.dmp and nodes.dmp files we supply along with HumGut are pruned to contain only what is needed for HumGut. To include fungi, here is my first suggestion on how to proceed:

1) Create the database folder, and start building the kraken2 database according to the kraken2 manual. a) add the taxonomy and then b) download the fungi library in the standard kraken2 way. 2) You must now extend the taxonomy to also include the HumGut tax_id's. This means adding to the files names.dmp and nodes.dmp in the taxonomy folder. These are text files, and in principle you could just add the content of, say ncbi_names.dmp, to the names.dmp: cat ncbi_names.dmp >> names.dmp and similar for the nodes.dmp file 3) Then proceed by adding the HumGut library and building the database, as described in our GitHub site.

I can forsee some problems in step 2, because both the original NCBI files names.dmp and nodes.dmp, as well as our supplied ncbi_names.dmp and ncbi_nodes.dmp will contain the root node and a lot of other nodes. By simply concatenating them, these nodes will be listed twice. I seem to remember kraken2 will not like this! If so, both the names.dmp and nodes.dmp must be pruned to have only unique lines.

I am updating the entire HumGut collection soon, and then I will also test this out and update the recipe on the GitHub site accordingly.

mjgommo commented 2 years ago

Hi Lars,

Thanks a lot. I will try and let you know.

Best regards,