isayevlab / AIMNet2

75 stars 22 forks source link

GPU Memory Problem #24

Closed cvsik closed 2 months ago

cvsik commented 2 months ago

Hi, thanks for the great contribution!

While trying to run the minimal example mol_single_opt.yml, I ran into the following memory issue while aimnet tries to build the nb list apparently:

Traceback (most recent call last):
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/bin/aimnet2pysis", line 33, in <module>
    sys.exit(load_entry_point('aimnet2calc', 'console_scripts', 'aimnet2pysis')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/Projects/aimnet2calc/aimnet2calc/aimnet2pysis.py", line 66, in run_pysis
    run.run()
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/run.py", line 2022, in run
    run_result = run_from_dict(run_dict, **run_kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/run.py", line 1966, in run_from_dict
    run_result = main(run_dict, restart, cwd, scheduler)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/run.py", line 1489, in main
    opt_result = run_opt(
                 ^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/drivers/opt.py", line 143, in run_opt
    opt.run()
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/optimizers/Optimizer.py", line 725, in run
    step = self.optimize()
           ^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/optimizers/RFOptimizer.py", line 74, in optimize
    energy, gradient, H, big_eigvals, big_eigvecs, resetted = self.housekeeping()
                                                              ^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/optimizers/HessianOptimizer.py", line 457, in housekeeping
    gradient = self.geometry.gradient
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/Geometry.py", line 937, in gradient
    return -self.forces
            ^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/Geometry.py", line 903, in forces
    forces = self.cart_forces
             ^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/Geometry.py", line 883, in cart_forces
    results = self.calculator.get_forces(self.atoms, self._coords)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/Projects/aimnet2calc/aimnet2calc/aimnet2pysis.py", line 49, in get_forces
    res = self.model(_in, forces=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/Projects/aimnet2calc/aimnet2calc/calculator.py", line 59, in __call__
    return self.eval(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/Projects/aimnet2calc/aimnet2calc/calculator.py", line 79, in eval
    data = self.prepare_input(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/Projects/aimnet2calc/aimnet2calc/calculator.py", line 98, in prepare_input
    data = self.make_nbmat(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/Projects/aimnet2calc/aimnet2calc/calculator.py", line 161, in make_nbmat
    data['nbmat'] = nblist_torch_cluster(data['coord'], self.cutoff, data['mol_idx'], max_nb=128)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/Projects/aimnet2calc/aimnet2calc/nblist.py", line 35, in nblist_torch_cluster
    sparse_nb = radius_graph(coord, batch=mol_idx, r=cutoff, max_num_neighbors=max_nb).to(torch.int32)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/torch_cluster/radius.py", line 135, in radius_graph
    edge_index = radius(x, x, r, batch, batch,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/torch_cluster/radius.py", line 82, in radius
    return torch.ops.torch_cluster.radius(x, y, ptr_x, ptr_y, r,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/torch/_ops.py", line 854, in __call__
    return self_._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.00 GiB. GPU

I wouldn't have expected to run out of memory that quickly on an A100 40 GB device 😁 The earlier versions (I guess without these nb lists?) ran fine, also on smaller devices.

Do you have any advice on what to do to fix this? Run the nb list creation on CPU only?

Many thanks in advance!

Best, Max

isayev commented 2 months ago

@cvsik, how large is your molecular system? Please also check if another GPU process is consuming the GPU memory.

cvsik commented 2 months ago

Hi, it's the test system in the repo, mol_single.xyz, with 22 atoms. There are no other processes running on the GPU, just the aimnet2 one.

I'm using the following pkg versions if that's of any help:

torch                     2.3.0+cu118              pypi_0    pypi
torch-cluster             1.6.3+pt23cu118          pypi_0    pypi
torchaudio                2.3.0+cu118              pypi_0    pypi
torchvision               0.18.0+cu118             pypi_0    pypi
zubatyuk commented 2 months ago

@cvsik : please check if it is fixed now.

cvsik commented 2 months ago

@cvsik : please check if it is fixed now.

You mean with this commit? No, that's a change I also did locally to be able to run at all...

zubatyuk commented 2 months ago

@cvsik : this commit should fix. If not, please could you profile arguments to torch_cluster.radius_graph call here

cvsik commented 2 months ago

Great, that seems to have fixed it. Thanks a lot for the quick response & fit! Much appreciated! 🚀