muellan / metacache

memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping
GNU General Public License v3.0
57 stars 12 forks source link

Custom Database Creation #14

Closed punnettsun closed 3 years ago

punnettsun commented 3 years ago

Hello,

I just need a little clarification. In the custom database documentation, would I be able to have my own taxonomy and ID rather than using NCBI taxonomy and database if I am able to provide all the necessary files? I have my own custom database and would like to use my own taxonomy rather than NCBI taxonomy and taxids.

Thank you.

muellan commented 3 years ago

Hi, yes you can use your own taxonomy but you have to follow the layout of the NCBI taxonomy with its tab-separated textfiles "names.dmp", "nodes.dmp", "merged.dmp". The taxids must be positive numbers and 0 should be root.

punnettsun commented 3 years ago

Great! Also, I opened an NCBI nodes and names dmp file, and they have 1 as the root, not 0?

punnettsun commented 3 years ago

I just wanted to update you that I have made MetaCache work using my own nodes.dmp, names.dmp, and merged.dmp files (using ID 1 as root). Thank you very much for the help, and I look forward to using MetaCache!

muellan commented 3 years ago

Oh, you're right - the root is of course '1' - '0' means "no taxon"

punnettsun commented 3 years ago

Hello again!

I was able to run a test dataset and have MetaCache classify my reads. I have a question about sequence level classifications. In the mapping file that was generated, I can see that the sequence level takes the header information until the first space in the genome assembly files like below. Ex: sequence:NZ_MSKV08572672.1

Since I know the genome assembly files I have contain multiple contigs/scaffolds, would I be able to just replace the header information to give strain name instead for each contig? So for each header in the genome assembly file, I would have the same strain number. If my genome assembly file for GCF_####### looks like this:

NZ_MSKV08572672.1 ATGCTTAGATCGT... .... NZ_MSKV08573869.1 CGTAGATCTCCTA... ...

I can modify the header information to make it look like this instead where all headers for each contig/scaffold have the same strain numbers:

strain_264762 ATGCTTAGATCGT... ... strain_264762 CGTAGATCTCCTA... ...

So when I run MetaCache, the mapping file would contain something like this if it mapped to any one of the contigs/scaffolds: sequence:strain_264762

Even though it says it's a sequence, I can interpret it as the strain if I change all the headers inside my files to reflect the strain number instead correct? Or would there be a problem if MetaCache sees that there are multiple contigs with the same header?

Thank you.

Funatiq commented 3 years ago

Hi! It is not possible to change the headers to the same name. MetaCache requires unique headers when adding the sequences to the database. So your strategy should work if you enumerate the sequences of each strain, e.g.

strain_264762_0
ATGCTTAGATCGT...
...
strain_264762_1
CGTAGATCTCCTA...
...

Alternatively, MetaCache supports several ways to map sequences to taxa. So you could add a separate taxon for each strain to your taxonomy and leave the sequence headers unchanged.

muellan commented 3 years ago

If you decide to use the second option - adding your own taxon ids for the strain level - then make sure

punnettsun commented 3 years ago

Great! I do have the taxonomy files (nodes.dmp and names.dmp, and an empty merged.dmp), reference genome files, and the assembly2taxid file that is NCBI-styled. However, when I ran MetaCache, I was only able to see sequence level and then species level and up. Strain level was not accounted even though I had it in my taxonomy files. The parent to the strains in my taxonomy are pointing towards a specific species. Would I need to have the strains set to a sub-species as its parent in order for this to work or should it be fine to have strain -> species -> genus ->etc.? I'm just not sure why strain isn't appearing when I have it in my taxonomy files. I am also only using 5 genomes as my reference with a sample dataset that contains reads from 2 of those reference genomes as a test.

I just checked and it seems that the ranking goes like this: sequence -> form -> variety -> subspecies -> species -> etc. I have the species to which the strains belong to so if I want to have strain level classifications, I could rank my strains as subspecies instead.

Thank you.

muellan commented 3 years ago

Yeah, it is not called "strain"; we tried to stick to the NCBI nomenclature. You could use Form or Variety or Subspecies.

Forget about the following comment - I just looked into the code and remembered that the LCA already works for all taxonomic levels:

Note that in the current version, if a read can be mapped in principle, but cannot be unambiguously mapped to a sequence/form/variety/subspecies, the lowest possible LCA that will be returned is "species". We might change that in a future version and include lower levels in the LCA computation - it's not much effort, but until now nobody had requested it.

punnettsun commented 3 years ago

I used subspecies rank for strains and it worked perfectly. Thank you!

muellan commented 3 years ago

Great! I hope your investigations will be successful. We would be highly interested in your experience and results regarding strain-level classification with Metacache.