lambda3 mkindexn on a large fasta file

Dear lambda creators

I think I may be missing something. I am trying to create a nucleotide index on a 677G fasta (nt) file and I get the expected error:

WARNING: Your sequence file is already larger than your physical memory!
         This means you will likely encounter a crash with "bad_alloc".
         Split you sequence file into many smaller ones or use a computer
         with more memory!

free -h
              total        used        free      shared  buff/cache   available
Mem:          503Gi        31Gi       432Gi       4.1Gi        39Gi       466Gi
Swap:          31Gi       1.8Gi        30Gi

My questions are, if I split the fasta file say into 3 and create separate indexes :

1. How would I run the search against the 3 lba files? and
1. would I not still have too little memory?

Kind regards Armand

Dear Armand,

even assuming that you manage to create the database, what is your use-case for using it? Unless you search >10GB of query sequences, your program runtime will be dominated by just loading the database (which will take super long as it is going to be around 2TB big in total).

If you search very large query files, this could still be worth it, but you will need to split the database, run the searches individually and then manually merge the output file. In such a case, I would recommend using m8 output, reducing the desired number of hits per query and then using a combination of the shell commands sort (increase allowed memory usage and threads) and awk (for filtering) to merge the files.

If you want to proceed with splitting the index, I would suggest the following:

Try with a small chunk (~30GB) first. Use /usr/bin/time -v to measure runtime and memory usage ("MaxRSS" value).
This will give you an indication of whether the time constraints are viable for you and how large you can make the chunks in a productive setting.
I would definitely recommend using .lba.gz to reduce the on-disk size of the index files. This may even make it faster when loading.

If you have any further questions, feel free to ask :)

seqan / lambda

lambda3 mkindexn on a large fasta file #229