iqbal-lab-org / BIGSI

BItsliced Genomic Signature Index - Efficient indexing and search in very large collections of WGS data
MIT License
9 stars 1 forks source link

Bigsi for 600000 genomes #12

Closed davidmaimoun closed 2 years ago

davidmaimoun commented 2 years ago

Hello,

I work in a ministry of health in the pathogenes detection. I read about Bigsi (congratulation!), and I think it could be useful. However didn't find the docs very useful (sorry), and the demo link https://bigsi.readme.io/ doesn't work We need to find presence of specific genes in 600.000 Salmonella genomes. I am in charge to find the better tool to do that. Could you tell me please if you think it's doable with Bigsi, and how many volume in my hard drive I need, and how much time (approximatively of course) it would take?

Thank you very much

David

leoisl commented 2 years ago

Hello,

I would recommend you to use COBS (https://github.com/bingmann/cobs) instead of BIGSI. It is a "reimplementation" of BIGSI, but uses compacted indexes (i.e. less disk space is required) and is much faster. However, it relies on having random access to the disk (e.g. SSD, flash filesystems, etc), so it wouldn't work well on hard disk filesystems. You also have the option of loading the whole index into memory, which would be heavy on RAM usage, but would work on any filesystem. I recently made a similar query to yours. I queried 367 sequences against a database of 661k bacterial genomes, and that took 22 hours and 3 GBs of RAM. But I should warn you that I think this runtime is slow because COBS is constantly making disk access (I did not tell it to load the entire index into RAM as it is 900GB), and although the disk is SSD, it is a shared cluster with hundreds of users, so accessing the disk is known to be slow. If you have a random-access disk for yourself, or with few users, it should be way faster

davidmaimoun commented 2 years ago

Thank you for reaching out So I'll learn about COBS I'm new in the bioinformatics field, so before to do task so big like this, I prefer to ask the community (I don't want to burn my boss's computer :) ) I'll check with the IT team what it's the cleverer thing to do regarding the compute area

Thank you so much for your help

leoisl commented 2 years ago

No problems! You could subsample your dataset to 1k genomes and try building and querying a COBS index to familiarise yourself with the tool.

Cheers

davidmaimoun commented 2 years ago

Thank you for the tips!