Open chunyuma opened 3 years ago
Hi @chunyuma , as for 1. @nlapier2 will be able to explain in more detail, but if I recall correctly, Metalign was trained on a subset of NCBI. There isn't really a since "NCBI database" (as there's RefSeq, the genomes
ftp server, the SRA, etc.), so some filtering was done, but @nlapier2 knows the details.
Thanks for these information, @dkoslicki!
Hi @chunyuma , to follow up on David's answer, the database consisted of every NCBI assembly from the ftp server (including those from both RefSeq and GenBank) that was complete enough to have a taxonomic label and corresponded to a microbial organism's genome (excluding animals and plants, for instance), as of roughly November 2017. It was basically the most comprehensive microbial reference database I could make from NCBI at the time, within reason.
Hi @nlapier2, thanks so much for your answer. I'm wondering if the current default database is complete enough to profile the microbial sample with only viruses or fungi rather than bacteria. Since I'm currently working on a simple comparison between Metaglin and MiCoP(I just realized that you are also the author for this tool), I want to be fair in the comparison for these two tools. MiCoP mainly focuses on the viral and fungal organisms, in order to have a fair comparison, I hope that the current metaglin database can contain the same genomes as MiCoP database. So do I need to retrain CMash to build a database which contains the same genomes used by MiCoP? Or I can directly use the current database?
@chunyuma Ah, interesting! I'd be interested to see what you find. The Metalign database should include everything that's in the Micop database and more (e.g. the Metalign database is a superset of the Micop database). In fact Micop is RefSeq-based, so Metalign should even have more viral and fungal genomes than Micop.
Currently we don't have any way to specify only viruses or fungi for Metalign, so it's not exactly an apples to apples comparison. Metalign will take some of the sequence that would normally be mapped to viruses/fungi by Micop and instead map it to bacteria, simply because there are a lot more bacterial reference genomes. On the one hand, this helps filter out false positive mappings, but on the other hand, it can create false negatives. So it's not clear to me which will be better. I suspect Metalign will be less prone to bias due to its more complete database, but it may also encounter a "dropout" issue for low-abundance viruses and fungi.
To answer your last question, the most direct way to compare them would be to retrain Metalign on Micop's database. In some sense this is the most "fair", theoretically, but most people would actually use Metalign's default database (which is more comprehensive anyways, as I said), so I think you could justify comparing them with their default databases. Since you said you want a simple comparison, probably the easiest thing to do is to run both of them with their default databases at first. Then if the Metalign results are bad, you could consider retraining it on Micop's database.
Ah, I see, thanks @nlapier2. Regarding the "dropout" issue for Metaglin, I'm wondering if there is a parameter to set a threshold for low-abundance. Since the Micop seems not to drop any low-abundance fungi and virus, this might also cause some biases to Metaglin result when compared with Micop.
so I think you could justify comparing them with their default databases.
I think you're right because it's impossible to know what microbes in the real sample. So using the default database might make more sense. In addition, I actually not just compared only these two tools but include other tools (eg. MetaPhlAn). I know in your paper, you've already compared Metaglin with MetaPhlAn2 but now they have the version 3 of MetaPhlAn which can allow to profile virus as well.
By the way, @nlapier2, can I know how large for the metaglin database? It's been downloading for a while (more than 3 hours). I also quite concerns if my disk is large enough to store metaglin database.
@chunyuma Yeah, it will take several hours. It is about 250GB compressed. It's much larger than the Micop database. Metalign does include several filtering options (as does Micop) -- see the wiki for examples (including --read_cutoff and --min_abundance).
And @chunyuma feel free to use the lab server (or ACI) in case disk space is an issue
I would like to jump in and ask that it would be nice that , in lieu of a tutorial (as I see is on a TODO but still isn't here), somewhere in on the README or in the wiki that it should be specified that you currently CAN'T input your 'own' database. Or at least add a list of some prequisites before running the tool.
I tried running it with a bunch of my own fastas for the select_db.py
and kept getting errors about requiring a db_info.txt
file which I didn't understand (when I thought I was trying to build my own database).
It's only now that I realise that the db
options refer to an already trained Cmash
database that has to be placed in the github repository - I had thought that the references in the help docs to the default data/
was just for an example run...
Hi @nlapier2 and @dkoslicki,
This is Chunyu and I have two questions regarding Metaglin.
I'm wondering if the default database used by Metaglin includes the genomes of all organisms in NCBI database
To run Metalign, is it possible to build a customized database which mainly focuses on all viruses and fungi in NCBI taxonomy database. Perhaps it might need to run CMash for all these genomes (viruses and fungi). I'm wondering if you will have a script to easily build this customized database. If you have, could you please provide me some instructions how I should do in order to build this customized database? Also, what files I should change within the
/data
files by using this customized database?Thank you so much!