torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
653 stars 122 forks source link

Memory requirements for very large database (> 500GB) #297

Closed Confurious closed 6 years ago

Confurious commented 6 years ago

Hi, I am wondering what the memory requirements are for searching against very large database (500GB - 1 TB)? On the query side I can do the splitting but splitting the database would produce less than optimal results. Thanks

torognes commented 6 years ago

Just for storing the database in memory VSEARCH requires at least 5 bytes of memory for each nucleotide in the database, plus some more for the headers and other information. With a database of 500GB to 1TB I think you would need at least 3 to 6TB of memory. I have never tested running VSEARCH with such a large database.

Confurious commented 6 years ago

Ouch I can't have 3 to 6 TB consistently. Just out of curiosity, what would you recommend if one has to do queries against very large database? Is splitting the database and combine the results of each query against each database (through some sorts of rules) the only way? Is that how BLAST handle things when dividing databases into small chunks (max=2GB) of databases? I would very much look for an alternative to BLAST and I was hoping vsearch would be the one. Thanks

torognes commented 6 years ago

I think I need some more information about the type of search you want to perform in order to be able to give a good answer. What type of sequences are the query and database sequence, how long are they? Are you looking only for the top hit, or do you need more hits for each query?

It may certainly be possible to split the database and then combine the results in some way.

Confurious commented 6 years ago

The aim was basically to construct a large database that includes selected bacteria, viruses, animal genomes etc. and to be able to classify a DNA fragment from samples of different sources. As a results, the database is larger than usual. The database will be a collection of reference genomes or draft genomes, the query will be either read or assembled contigs from reads. I need more than top hit to have a reasonable conservative taxonomy assignment (Using the LCA method).

That's great to hear! Assuming I am to use vsearch for this purpose, what would you recommend the pooling-based decision based on? Percentage identity alone? Is there a blast-like e or bit score in vsearch? Thanks

torognes commented 6 years ago

VSEARCH is designed to work with rather short sequences, like single reads or short fragments. It does not work well with longer sequences, e.g. 5kb or longer, as it will be rather slow. Including entire genomes in the database is not recommended. I will therefore advise you to find another tool for this.

Confurious commented 6 years ago

Hello, i did not know that the database is supposed to be short fragments too? Just the queries need to be? Thanks

On Tue, Apr 3, 2018 at 04:15 Torbjørn Rognes notifications@github.com wrote:

VSEARCH is designed to work with rather short sequences, like single reads or short fragments. It does not work well with longer sequences, e.g. 5kb or longer, as it will be rather slow. Including entire genomes in the database is not recommended. I will therefore advise you to find another tool for this.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/torognes/vsearch/issues/297#issuecomment-378214601, or mute the thread https://github.com/notifications/unsubscribe-auth/AVzXMV8XoxRVDTCKVep2qGNO4OoAadb4ks5tk1negaJpZM4S2fSr .

-- Sincerely yours, Chao Jiang

torognes commented 6 years ago

Both the queries and database sequences are supposed to be rather short.

Confurious commented 6 years ago

I see. I suppose NT and NR are kind of like collections of short fragments. However I assume a lot of average users would attempt to make customized databases with references genomes of bacteria etc., which are still millions of basepairs. Is this because of semi-global alignment instead of local alignment? Thanks

On Tue, Apr 3, 2018 at 05:40 Torbjørn Rognes notifications@github.com wrote:

Both the queries and database sequences are supposed to be rather short.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/torognes/vsearch/issues/297#issuecomment-378235397, or mute the thread https://github.com/notifications/unsubscribe-auth/AVzXMQH0fDM7FB_Oa8lpikDjHF8SPXuAks5tk22hgaJpZM4S2fSr .

-- Sincerely yours, Chao Jiang

torognes commented 6 years ago

NT and NR contain many long sequences in addition to the short ones.

VSEARCH performs full optimal global alignment of the entire sequences instead of the hit-and-extend approach in BLAST and other tools. This is why VSEARCH is so slow with long sequences, as it takes time proportional to the product of the length of the sequences.

colinbrislawn commented 6 years ago

If you are looking for faster (but less exact) alignment tools, my I suggest bbmap? It was designed for searching very large databases and is wildly fast. If my database was >500 GB I would start there.

If you are willing to spend more time to get more accuracy, you could try minimap2, written by the famous developer of bwa.

Confurious commented 6 years ago

Thanks! The minimap2 looks like a direct upgrade to bwa-mem, as it excels at both short and long reads mapping?!

On Tue, Apr 3, 2018 at 8:53 AM, Colin Brislawn notifications@github.com wrote:

If you are looking for faster (but less exact) alignment tools, my I suggest bbmap https://jgi.doe.gov/data-and-tools/bbtools/? It was designed for searching very large databases and is wildly fast. If my database was >500 GB I would start there.

If you are willing to spend more time to get more accuracy, you could try minimap2 https://github.com/lh3/minimap2, written by the famous developer of bwa https://github.com/lh3/bwa.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/torognes/vsearch/issues/297#issuecomment-378300240, or mute the thread https://github.com/notifications/unsubscribe-auth/AVzXMaPXXufcXnDCFJrUZHKncDzjZAJ2ks5tk5sVgaJpZM4S2fSr .

-- Sincerely yours, Chao Jiang

colinbrislawn commented 6 years ago

That's the impression I got from the preprint, but I'm not sure.

Any heuristic local aligner will be faster then vsearch, which is designed for optimal alignments on short reads. These are really different tools for different jobs.

EDIT: According to the author, yes "minimap2 is a much better mapper than bwa-mem in almost every aspect".