ur-whitelab / hoomd-tf

A plugin that allows the use of Tensorflow in Hoomd-Blue for GPU-accelerated ML+MD
https://hoomd-tf.readthedocs.io
MIT License
30 stars 8 forks source link

CG Learning Model Runs Very Slowly #315

Closed RainierBarrett closed 3 years ago

RainierBarrett commented 3 years ago

The example code I linked here runs on GPU, but extremely slowly (TPS ~40 on our HPC machine). I did some profiling with nvprof, output attached. As we can see from this output, about 50% of the time spent on CUDA API calls is in 'cudaLaunchKernel' and 'cuLaunchKernel'. This leads me to believe something might be launching too many kernel calls unnecessarily. I'd love some help digging this up!

NV_PROFILE_HTF_Online_Demo.txt

whitead commented 3 years ago

Hi @RainierBarrett! Took a quick look through this. I believe only the GPU calls at the top are relevant, I think the API call timings include the time spent on GPU computations. You'll notice the sum of time in GPU calls is about the same as time in cudaLaunchKernel and cuLaunchKernel.

It looks like the time spent on GPU is mostly on computing the neighbor lists (topk is only done for nlist). I believe the htf nlist computation is n^2 because it was assumed to be negligible in a mapped system were N would be small. One fix might be to try to turn on XLA to get operator fusion of the neighborlist code so that it requires fewer operations. Another fix would be to ensure you only have the particles you need to train on in the system (maybe solvent or something is also there?). Lastly you could cache the nlists (probably on disc, not memory) and then you would only need to compute once.

Probably can be addressed as part of #279

RainierBarrett commented 3 years ago

Hmm, there's no solvent in this sim so that's not it. I'll try with the XLA command line flags and see if that helps any.

RainierBarrett commented 3 years ago

Alright, so the XLA flag --tf_xla_auto_jit=2 does make the training sim run faster, but it also causes learning to completely fail. I wind up with all parameters coming out as all zeros every step. Have you guys run into anything like this? As for the nlist scaling, is there any possibility of dispatching to a separate HOOMD instance with the CG positions? I know they have some fancy neighbor list optimizations already so it would make sense to not reinvent the wheel if we can get away with it...

whitead commented 3 years ago

Would it be enough to cache the nlist or can you not even make it through one loop? We could do HTF but maybe we'll loop in Joshua on this because it's quite a bit of heavy lifting to get a system to the state that you can compute nlists.

I have noticed XLA messes things up in the past, that's why we turned it off for default.

RainierBarrett commented 3 years ago

I'm not sure I follow what you mean for caching the neighbor list. Do you mean build one and use it for a while but check on it and rebuild every so often?

whitead commented 3 years ago

The features (model inputs) are nlist and positions and your labels are forces. You could compute these all once before training and store them as a TF dataset or numpy arrays. Then when training, you just load these. This would "cache" the nlist and you only will need to compute it once for each frame. I'm happy to write a function that does this since this is a common problem for @mehradans92's work too. However, my question was if you can even do this once. Is it so slow that the nlist computation is impossible?

If this won't solve your problem, I do have a self-contained C program that efficiently computes neighbor lists in O(N). We could add it to the htf C++ module.

RainierBarrett commented 3 years ago

Ah I understand now. So that would be moving to offline training instead of in tandem with a simulation. Yes, I'm able to run the simulations, just slowly, so I suppose that is an option but I am not sure that would really speed anything up, just move the rate-limiting step somewhere else.

whitead commented 3 years ago

I'm not following, if you're doing training while simulating you shouldn't be using the compute_nlist function. That's just there for when you're reading offline from a trajectory. Could there be a bug in the model that it's calling that instead of using the hoomd-blue nlist?

whitead commented 3 years ago

Just to be clear, I believe compute_nlist is being called because you have this line

                   12.99%  10.5472s     10000  1.0547ms  1.0037ms  1.1120ms  void tensorflow::impl::TopKKernel<double>(double const *, int, int, bool, tensorflow::impl::TopKKernel<double>*, int*)

which only appears in the compute_nlist code

whitead commented 3 years ago

Also, it occurs to me that you have doubles everywhere, which also slows performance. Probably not by a lot, but it can depend on the card.

whitead commented 3 years ago

Switched to #317