lightdock / lightdock-python2.7

Protein-protein, protein-peptide and protein-DNA docking framework based on the GSO algorithm
https://lightdock.org/
GNU General Public License v3.0
26 stars 9 forks source link

Cannot allocate memory #26

Closed qiuzy closed 4 years ago

qiuzy commented 4 years ago

I am trying to test lightdock performance on my case. I have a successful run with 200 swarms with 100 glowworms (step 100) in each swarm. Now I want to another set running with 500 swarms and 200 glowworms in each swarm (step 200). But it fails with error "Cannot allocate memory" very quickly.

So, How to solve this problem? Thanks!!

brianjimenez commented 4 years ago

Please, could you point out if this is happening in the setup or the simulation step? Also, any details about your system and hardware will be very appreciated to give you an accurate solution.

qiuzy commented 4 years ago

sorry for the imcomplete description. The error happens during the simulation step. Two protein exist in my case with 618 and 228 residues. H and O atoms are in the pdb file. (--noxt and --noh are specified in the setup step.) After running "lightdock_setup rec.pdb lig.pdb 500 200 --noxt --noh --seed_point 8543234", setup step scuccess. Then, the job is submitted onto the local server using 70 cores in one node. The related setting in the job submitting script is

SBATCH -N 1####nodes

SBATCH -n 70####procode

lightdock setup.json 200 -s fastdfire -c 70 -min

The local sever is configured with Intel(R) Xeon(R) CPU E5-2650, memory is about 400G. An error is thrown after "Monster spotted". The last 20 lines is "kraken] INFO: Tentacle ready with 7 tasks ..... [kraken] INFO: Tentacle ready with 7 tasks [kraken] INFO: Tentacle ready with 7 tasks [kraken] INFO: 500 ships ready to be smashed [lightdock] INFO: Monster spotted [kraken] INFO: Release the Kraken! [lightdock] ERROR: OS error found [lightdock] ERROR: Lightdock has failed, please check traceback"

brianjimenez commented 4 years ago

System seems quite small, you should not face any memory issue. A couple of things:

Please, let me know if this fixes your problem.

qiuzy commented 4 years ago

It seems work. But the above setting fails and I tried other combinations, (40 cores in SBATCH header, but -c 10) So what is the rule of thumb to set these values?

brianjimenez commented 4 years ago

Just to be sure everything is in place, have you checked if you have multiple copies of the molecule in lightdock_receptor.pdb and lightdock_ligand.pdb? If you run multiple times the setup step in the same system, structures are appended to the same pdb file and that would make your system massive. We are running quite big systems in our small cluster and we don't have any issue with memory. Please, would you mind to send us your system for testing and debugging?

qiuzy commented 4 years ago

pro_rec.pdb.txt prot_lig.pdb.txt

I have attached the receptor and ligand protein. To upload them, I add txt as the file suffixes. They can be used after directly removing the suffixes. To repeat the error, I have tried again with 70 cores setting in SLRM, but running with -c50 as set in the subdock.sh.txt file. In expection, it fails. But if I change to 10 cores, the job is ok. subdock.sh.txt

I have checked the issue you mentioned above. Protein structure in pdb file doesn't repeat. Thanks for your time and efforts to debug.!

brianjimenez commented 4 years ago

I've tested your system locally and in our small cluster, memory allocation using -anm --noxt --noh flags is about 2/2.5GB per core. If you want to calculate how many cores would be OK, you should consider how many cores per node and what is the total memory available per node. If a node in your cluster is about 48 cpu cores, memory should be around 120GB. Make sure no other processes are using the same node/memory.

About the system, if you suspect there are some possible binding regions, I'd go for defining some residue restraints (even if vague) which will cut down a lot the resources needed (the total number of swarms) and probably results would be much improved.

brianjimenez commented 4 years ago

Please contact us back if you experience any other issue.