metaDMG-dev / metaDMG-core

metaDMG
MIT License
6 stars 5 forks source link

Missing documentation #16

Open susheelbhanu opened 1 month ago

susheelbhanu commented 1 month ago

Hi,

I'm trying to use this tool but it's unclear where the names and nodes files comes from in the documentation.

$ metaDMG config raw_data/alignment.sorted.bam \
    --names raw_data/names-mdmg.dmp \
    --nodes raw_data/nodes-mdmg.dmp \
    --acc2tax raw_data/acc2taxid.map.gz \
    --custom-database

Does the tool provide or download it automatically?

Thanks, Susheel

FranckLejzerowicz commented 1 week ago

Hi Susheel,

I figured that if you follow what's done in the Tutorial, you can see that the files distributed with metaDMG cannot be necessarily used for your own data.

This tutorial is not very explicit, but then it makes sense to peek into the files downloaded with metaDMG get-data --output-dir raw_data:

Indeed, in raw_data/acc2taxid.map.gz, you can discover that what the metaDMG-cpp will use to lookup taxids and calculate the LCA are likely the "chr" listed as accession. Likewise, these accession are the targets onto which the reads were mapped.

For example, if you download these files using metaDMG get-data, you can see that is matches and there is a taxid for a given mapped-onto genome.

$ samtools view alignment.sorted.bam | grep GCA_000007325.1 | wc -l
581
$ zcat acc2taxid.map.gz | grep GCA_000007325.1
GCA_000007325.1 GCA_000007325.1 21768   2

The thing is, if you use a custom database and not NCBI genomes, that you have to make sure you make matching contents in files passed to --names, --nodes and --acc2tax. For each of my metagenomes' contigs, I'll be pulling taxids from the GOs of the majority of genes annotated using eggnog-mapper, and make sure the taxids are themselves pulled/referenced is an NBCI taxdump (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/).

Hope this helps!