seqan / lambda

LAMBDA – the Local Aligner for Massive Biological DatA
https://github.com/seqan/lambda
Other
77 stars 20 forks source link

lambda3 mkindexn on a large fasta file #229

Open ArmandBester opened 3 months ago

ArmandBester commented 3 months ago

Dear lambda creators

I think I may be missing something. I am trying to create a nucleotide index on a 677G fasta (nt) file and I get the expected error:

WARNING: Your sequence file is already larger than your physical memory!
         This means you will likely encounter a crash with "bad_alloc".
         Split you sequence file into many smaller ones or use a computer
         with more memory!
free -h
              total        used        free      shared  buff/cache   available
Mem:          503Gi        31Gi       432Gi       4.1Gi        39Gi       466Gi
Swap:          31Gi       1.8Gi        30Gi

My questions are, if I split the fasta file say into 3 and create separate indexes :

Kind regards Armand

h-2 commented 3 months ago

Dear Armand,

even assuming that you manage to create the database, what is your use-case for using it? Unless you search >10GB of query sequences, your program runtime will be dominated by just loading the database (which will take super long as it is going to be around 2TB big in total).

If you search very large query files, this could still be worth it, but you will need to split the database, run the searches individually and then manually merge the output file. In such a case, I would recommend using m8 output, reducing the desired number of hits per query and then using a combination of the shell commands sort (increase allowed memory usage and threads) and awk (for filtering) to merge the files.

If you want to proceed with splitting the index, I would suggest the following:

If you have any further questions, feel free to ask :)