sirimullalab / DLSCORE

DLSCORE: A deep learning based scoring function for predicting protein-ligand binding affinity
MIT License
49 stars 19 forks source link

Performance issues #6

Closed eventfulsean closed 4 years ago

eventfulsean commented 5 years ago

Hi,

Thank you for releasing this code. We have been trying to implement this but found that DLScore is very slow in terms of performance, even when disabling the NNscore component, around 5 seconds per compound. Furthermore, we have not achieved good scaling when using multiple parallel instances (on the same host), observing little scale-up (<50%) when splitting the job across 10 CPUs (10 concurrent DLScore runs) and plateau around 20 CPUs.

Is this something you've observed as well and could you give some pointers as to how to improve performance?

Thanks!

hassanmohsin commented 5 years ago

Thanks for trying DLScore. It uses Keras to load and run the models and optimized for TensorFlow.

If you use the GPU version of Tensorflow then the running time will be significantly high as it requires data to be transferred between CPU and GPU which takes longer than the actual model execution time. Moreover, smooth scaling is not possible with one or two GPUs.

It is preferred to use the CPU version of TensorFlow. For scaling, please make sure that every CPU is running for only one TensorFlow session. We have tested our code on Thousands of CPUs on a cluster and observed no issues with scaling.

eventfulsean commented 5 years ago

Thanks for the quick response! We have been trying to replicate your implied performance with the CPU version of TensorFlow. We initially ran into issues you mentioned with competing TensorFlow instances on a single CPU (multi-core). We haven't tested but hypothesize that we can get around it with running DLSCORE within 'docker'. Any thoughts on how to circumvent TensorFlow limitations would be appreciated.

Also, when you say you have tested on "Thousands of CPUs on a cluster", do you mean one DLSCORE instance running on each node? We also observed the same but we ran 8 DLSCORE instances on a 16-core CPU node. We saw about 10 sec/instance/compound (10 networks). It could be quicker with a single instance of DLSCORE on each node but our benchmark suggest better cost/return if using ratio of 1:2.

If at all possible, could you share with us how you're calling DLSCORE in the distributed job?

Thanks! Sean

hassanmohsin commented 5 years ago

Hi Sean, TensorFlow is developed to run on a multi-core processor with only one instance. Running in a docker container would be a good idea but I would suggest you allocate at least 8 threads for each of the instances. Please keep an eye on the available memory size while running too many instances.

Yes, we ran every instance of TensorFlow on a single node. We used TACC launcher (https://github.com/TACC/launcher) for that. I would suggest consulting the test scripts (https://github.com/TACC/launcher/tree/master/tests) to have an idea on how to run multiple instances on a cluster. Feel free to comment if you see any issue with that.

Thanks, Hassan