nlapier2 / Metalign

Metalign: efficient alignment-based metagenomic profiling via containment min hash
MIT License
31 stars 7 forks source link

Build a customized database? #39

Open chunyuma opened 3 years ago

chunyuma commented 3 years ago

Hi @nlapier2 and @dkoslicki,

This is Chunyu and I have two questions regarding Metaglin.

  1. I'm wondering if the default database used by Metaglin includes the genomes of all organisms in NCBI database

  2. To run Metalign, is it possible to build a customized database which mainly focuses on all viruses and fungi in NCBI taxonomy database. Perhaps it might need to run CMash for all these genomes (viruses and fungi). I'm wondering if you will have a script to easily build this customized database. If you have, could you please provide me some instructions how I should do in order to build this customized database? Also, what files I should change within the /data files by using this customized database?

Thank you so much!

dkoslicki commented 3 years ago

Hi @chunyuma , as for 1. @nlapier2 will be able to explain in more detail, but if I recall correctly, Metalign was trained on a subset of NCBI. There isn't really a since "NCBI database" (as there's RefSeq, the genomes ftp server, the SRA, etc.), so some filtering was done, but @nlapier2 knows the details.

  1. There is a branch in CMash that shows how to retrain CMash and Metalign. Take a look at the bash scripts here which shows a minimal working example.
chunyuma commented 3 years ago

Thanks for these information, @dkoslicki!

nlapier2 commented 3 years ago

Hi @chunyuma , to follow up on David's answer, the database consisted of every NCBI assembly from the ftp server (including those from both RefSeq and GenBank) that was complete enough to have a taxonomic label and corresponded to a microbial organism's genome (excluding animals and plants, for instance), as of roughly November 2017. It was basically the most comprehensive microbial reference database I could make from NCBI at the time, within reason.

chunyuma commented 3 years ago

Hi @nlapier2, thanks so much for your answer. I'm wondering if the current default database is complete enough to profile the microbial sample with only viruses or fungi rather than bacteria. Since I'm currently working on a simple comparison between Metaglin and MiCoP(I just realized that you are also the author for this tool), I want to be fair in the comparison for these two tools. MiCoP mainly focuses on the viral and fungal organisms, in order to have a fair comparison, I hope that the current metaglin database can contain the same genomes as MiCoP database. So do I need to retrain CMash to build a database which contains the same genomes used by MiCoP? Or I can directly use the current database?

nlapier2 commented 3 years ago

@chunyuma Ah, interesting! I'd be interested to see what you find. The Metalign database should include everything that's in the Micop database and more (e.g. the Metalign database is a superset of the Micop database). In fact Micop is RefSeq-based, so Metalign should even have more viral and fungal genomes than Micop.

Currently we don't have any way to specify only viruses or fungi for Metalign, so it's not exactly an apples to apples comparison. Metalign will take some of the sequence that would normally be mapped to viruses/fungi by Micop and instead map it to bacteria, simply because there are a lot more bacterial reference genomes. On the one hand, this helps filter out false positive mappings, but on the other hand, it can create false negatives. So it's not clear to me which will be better. I suspect Metalign will be less prone to bias due to its more complete database, but it may also encounter a "dropout" issue for low-abundance viruses and fungi.

nlapier2 commented 3 years ago

To answer your last question, the most direct way to compare them would be to retrain Metalign on Micop's database. In some sense this is the most "fair", theoretically, but most people would actually use Metalign's default database (which is more comprehensive anyways, as I said), so I think you could justify comparing them with their default databases. Since you said you want a simple comparison, probably the easiest thing to do is to run both of them with their default databases at first. Then if the Metalign results are bad, you could consider retraining it on Micop's database.

chunyuma commented 3 years ago

Ah, I see, thanks @nlapier2. Regarding the "dropout" issue for Metaglin, I'm wondering if there is a parameter to set a threshold for low-abundance. Since the Micop seems not to drop any low-abundance fungi and virus, this might also cause some biases to Metaglin result when compared with Micop.

so I think you could justify comparing them with their default databases.

I think you're right because it's impossible to know what microbes in the real sample. So using the default database might make more sense. In addition, I actually not just compared only these two tools but include other tools (eg. MetaPhlAn). I know in your paper, you've already compared Metaglin with MetaPhlAn2 but now they have the version 3 of MetaPhlAn which can allow to profile virus as well.

chunyuma commented 3 years ago

By the way, @nlapier2, can I know how large for the metaglin database? It's been downloading for a while (more than 3 hours). I also quite concerns if my disk is large enough to store metaglin database.

nlapier2 commented 3 years ago

@chunyuma Yeah, it will take several hours. It is about 250GB compressed. It's much larger than the Micop database. Metalign does include several filtering options (as does Micop) -- see the wiki for examples (including --read_cutoff and --min_abundance).

dkoslicki commented 3 years ago

And @chunyuma feel free to use the lab server (or ACI) in case disk space is an issue

jfy133 commented 3 years ago

I would like to jump in and ask that it would be nice that , in lieu of a tutorial (as I see is on a TODO but still isn't here), somewhere in on the README or in the wiki that it should be specified that you currently CAN'T input your 'own' database. Or at least add a list of some prequisites before running the tool.

I tried running it with a bunch of my own fastas for the select_db.py and kept getting errors about requiring a db_info.txt file which I didn't understand (when I thought I was trying to build my own database).

It's only now that I realise that the db options refer to an already trained Cmash database that has to be placed in the github repository - I had thought that the references in the help docs to the default data/ was just for an example run...