isayevlab / AIMNet2

MIT License
87 stars 24 forks source link

Hessian computation fails #25

Closed cvsik closed 3 months ago

cvsik commented 3 months ago

Hi again,

are there any special things to consider when requesting a Hessian?

I tried this with the mol_single_opt.yml example, requesting a Hessian computation after optimization:

calc:
  type: aimnet
  model: aimnet2_b973c

geom:
  type: dlc
  fn: mol_single.xyz

opt:
  type: rfo
  max_cycles: 50
  do_hess: True

After the optimization is done, I get the following traceback:

... started Hessian calculation
Traceback (most recent call last):
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/bin/aimnet2pysis", line 33, in <module>
    sys.exit(load_entry_point('aimnet2calc', 'console_scripts', 'aimnet2pysis')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/Projects/aimnet2calc/aimnet2calc/aimnet2pysis.py", line 66, in run_pysis
    run.run()
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/run.py", line 2022, in run
    run_result = run_from_dict(run_dict, **run_kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/run.py", line 1966, in run_from_dict
    run_result = main(run_dict, restart, cwd, scheduler)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/run.py", line 1489, in main
    opt_result = run_opt(
                 ^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/drivers/opt.py", line 213, in run_opt
    do_final_hessian(
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/helpers.py", line 465, in do_final_hessian
    hessian = geom.cart_hessian
              ^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/pysisyphus/Geometry.py", line 959, in cart_hessian
    results = self.calculator.get_hessian(self.atoms, self._coords)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/Projects/aimnet2calc/aimnet2calc/aimnet2pysis.py", line 56, in get_hessian
    res = self.model(_in, forces=True, hessian=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/Projects/aimnet2calc/aimnet2calc/calculator.py", line 59, in __call__
    return self.eval(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/Projects/aimnet2calc/aimnet2calc/calculator.py", line 85, in eval
    data = self.get_derivatives(data, forces=forces, stress=stress, hessian=hessian)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/Projects/aimnet2calc/aimnet2calc/calculator.py", line 225, in get_derivatives
    data['hessian'] = self.calculate_hessian(data['forces'], self._saved_for_grad['coord'])
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/Projects/aimnet2calc/aimnet2calc/calculator.py", line 233, in calculate_hessian
    torch.autograd.grad(_f, coord, retain_graph=True)[0]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/torch/autograd/__init__.py", line 412, in grad
    result = _engine_run_backward(
             ^^^^^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/aimnet2/lib/python3.12/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: the derivative for '_cdist_backward' is not implemented.

Any idea how to get around the missing deriv. impl. in pytorch? Thanks!

zubatyuk commented 3 months ago

What is your pytorch version? Do you use CUDA? It works for me both on CPU and GPU.

  pytorch                        2.3.1           py3.11_cuda12.1_cudnn8.9.2_0  pytorch    
  pytorch-cluster                1.6.3           py311_torch_2.3.0_cu121       pyg        
  pytorch-cuda                   12.1            ha16c6d3_5                    pytorch    
cvsik commented 3 months ago

I'm using these versions from pip (followed the torch installation instructions from their site):

torch                     2.3.1+cu118              pypi_0    pypi
torch-cluster             1.6.3+pt23cu118          pypi_0    pypi
torchaudio                2.3.1+cu118              pypi_0    pypi
torchvision               0.18.1+cu118             pypi_0    pypi

And yes, I'm running on CUDA, however, I get the same error on CPU. I'll investigate this further, maybe nuke my conda environment and start from scratch.

cvsik commented 3 months ago

I get the same error with these package sources:

pytorch                   2.3.1           py3.11_cuda12.1_cudnn8.9.2_0    pytorch
pytorch-cluster           1.6.3           py311_torch_2.3.0_cu121    pyg
pytorch-cuda              12.1                 ha16c6d3_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torchtriton               2.3.1                     py311    pytorch
cvsik commented 3 months ago

The only spot where torch.cdist occurs (in aimnet2calc) is in nblists_torch_pbc. As far as I know, the computation failing above is not using PBC, so I'm wondering why torch.cdist even occurs in the graph. I'll investigate further 😁

zubatyuk commented 3 months ago

@cvsik I cannot reproduce your error yet. It works for with with pytorch 2.3 and 2.2. I'll investigate further.

cvsik commented 3 months ago

Maybe another piece of information: Here's the SHA256 of the model I'm using: 4d93cacf3e8517c5ddd79e9dae1a0482cb722c3d6d6f86f2b2082abbb3c436e4 assets/aimnet2/aimnet2_b973c_0.jpt

Are you using the same model file? Could that be a possible source for the failure?

zubatyuk commented 3 months ago

Just in case, can you delete this file. It will be re-downloaded on first call. Need to add a function to clean up model cache.

On Thu, Jul 11, 2024, 11:42 AM Maximilian Scheurer @.***> wrote:

Maybe another piece of information: Here's the SHA256 of the model I'm using: 4d93cacf3e8517c5ddd79e9dae1a0482cb722c3d6d6f86f2b2082abbb3c436e4 assets/aimnet2/aimnet2_b973c_0.jpt

Are you using the same model file? Could that be a possible source for the failure?

— Reply to this email directly, view it on GitHub https://github.com/isayevlab/AIMNet2/issues/25#issuecomment-2223521323, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7FDQYQFQJSE5PSB6KI3VDZL27ZLAVCNFSM6AAAAABKSG2B6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRTGUZDCMZSGM . You are receiving this because you commented.Message ID: @.***>

cvsik commented 3 months ago

Removed the file, but that didn't change anything :(

zubatyuk commented 3 months ago

@cvsik to investigate, please do: show output of command python -c "import torch; print(torch.__version__); print(torch.cuda.get_device_name())" try with these jpt files (download and provide full path in AIMNet2Caclculator or in model option in pysis yaml): https://drive.google.com/file/d/1zfVbAFl9hDtchgDcLfj5Bk-5j9JPhuZq/view?usp=sharing https://drive.google.com/file/d/16yefZW8OqkRV7xeRS5Qhk4-YhpAz_11j/view?usp=sharing

cvsik commented 3 months ago

@cvsik to investigate, please do: show output of command python -c "import torch; print(torch.__version__); print(torch.cuda.get_device_name())"

On the GPU device, I get:

2.3.1+cu118
NVIDIA A100-SXM4-40GB

try with these jpt files (download and provide full path in AIMNet2Caclculator or in model option in pysis yaml): https://drive.google.com/file/d/1zfVbAFl9hDtchgDcLfj5Bk-5j9JPhuZq/view?usp=sharing https://drive.google.com/file/d/16yefZW8OqkRV7xeRS5Qhk4-YhpAz_11j/view?usp=sharing

The file t_a.jpt fails to compupte a Hessian on both CPU and GPU. The file t_b.jpt succeeds to compute a Hessian on both CPU and GPU (the geometry optimization with that model takes a couple more iterations though).

zubatyuk commented 3 months ago

Updated models in aimnet-model-zoo repository. Clean your local cache of jpt files to re-download on first use. Please confirm that the solution works.

cvsik commented 3 months ago

Thanks! The new files seem to work.

However, I'm a bit worried about the optimization convergence now. With the old files, the optimization output was

Spent 0.0 s preparing the first cycle.
       cycle    Δ(energy) max(|force|)   rms(force)  max(|step|)    rms(step)      s/cycle
           0          nan     0.018289     0.004151     0.060453     0.020601         0.64
           1    -0.001423     0.011384     0.001937     0.086822     0.025547         0.12
           2    -0.000734     0.004380     0.001427     0.101807     0.031504         0.14
           3    -0.000525     0.006071     0.001648     0.172282     0.054327         0.13
           4    -0.000157     0.007260     0.001905     0.108513     0.026761         0.07
           5    -0.000476     0.005253     0.001164     0.028623     0.011377         0.16
           6    -0.000164     0.002196     0.000491     0.036787     0.011066         0.15
           7    -0.000055     0.000669     0.000231     0.026117     0.010106         0.15
           8    -0.000029     0.000477     0.000170     0.021280     0.007841         0.18
       Converged!

Final summary:
                max(forces, internal): 0.000477 hartree/(bohr,rad)
                rms(forces, internal): 0.000170 hartree/(bohr,rad)
                max(forces,cartesian): 0.001113 hartree/bohr
                rms(forces,cartesian): 0.000316 hartree/bohr
                energy: -591.29441587 hartree

and with the new files:

Spent 0.0 s preparing the first cycle.
       cycle    Δ(energy) max(|force|)   rms(force)  max(|step|)    rms(step)      s/cycle
           0          nan     0.017750     0.004166     0.086080     0.024756         1.02
           1    -0.001863     0.011136     0.002021     0.122529     0.038297         0.05
           2    -0.000282     0.017567     0.003144     0.042002     0.012152         0.02
           3    -0.000923     0.004899     0.001212     0.069663     0.016265         0.02
           4    -0.000374     0.006398     0.001196     0.070393     0.016137         0.02
           5    -0.000625     0.001789     0.000775     0.118025     0.038432         0.02
           6    -0.003792     0.020059     0.003927     0.107212     0.032400         0.02
       Unexpected energy increase (0.004194 au)! Trust radius: old=0.25, new=0.1
           7     0.004194     0.007814     0.002283     0.046027     0.012910         0.02
           8    -0.000811     0.002656     0.001077     0.054215     0.022201         0.02
           9    -0.000512     0.003491     0.001056     0.038985     0.012524         0.02
       -----------------------------------------------------------------------------------
          10    -0.003114     0.006746     0.001384     0.066427     0.019017         0.02
       Unexpected energy increase (0.002207 au)! Trust radius: old=0.2, new=0.1
          11     0.002207     0.003754     0.000945     0.029394     0.012864         0.02
       Unexpected energy increase (0.000547 au)! Trust radius: old=0.1, new=0.1
          12     0.000547     0.003048     0.000814     0.037817     0.012920         0.02
          13    -0.000121     0.002307     0.000780     0.047526     0.012897         0.02
          14    -0.000125     0.002052     0.000709     0.097302     0.026686         0.02
          15    -0.000547     0.002337     0.000502     0.091989     0.025495         0.02
          16    -0.000661     0.002088     0.000439     0.091971     0.025640         0.02
          17    -0.000267     0.001495     0.000536     0.096718     0.025820         0.02
       Unexpected energy increase (0.000685 au)! Trust radius: old=0.2, new=0.1
          18     0.000685     0.045257     0.009926     0.055442     0.012909         0.02
          19    -0.002688     0.021798     0.003942     0.055459     0.012919         0.02
       Unexpected energy increase (0.002269 au)! Trust radius: old=0.1, new=0.1
       -----------------------------------------------------------------------------------
          20     0.002269     0.005212     0.001000     0.044134     0.012881         0.02
          21    -0.000755     0.000827     0.000393     0.045115     0.012767         0.02
          22    -0.000202     0.001580     0.000425     0.045448     0.012801         0.02
          23    -0.000205     0.002629     0.000565     0.046364     0.013046         0.02
          24    -0.000217     0.002733     0.000620     0.046204     0.012788         0.02
          25    -0.000175     0.002225     0.000636     0.047212     0.012895         0.02
          26    -0.000222     0.001328     0.000580     0.047433     0.012845         0.02
          27    -0.000226     0.001398     0.000586     0.040439     0.012961         0.02
       Unexpected energy increase (0.000354 au)! Trust radius: old=0.1, new=0.1
          28     0.000354     0.003261     0.001228     0.042837     0.012915         0.02
          29    -0.000370     0.001378     0.000515     0.046827     0.012784         0.02
       -----------------------------------------------------------------------------------
          30    -0.000132     0.001411     0.000407     0.051432     0.012808         0.02
          31    -0.000237     0.001372     0.000354     0.052238     0.012887         0.02
          32    -0.000229     0.001312     0.000315     0.054556     0.012901         0.02
          33    -0.000218     0.001138     0.000288     0.052642     0.012907         0.02
          34    -0.000200     0.001119     0.000275     0.054589     0.012907         0.02
          35    -0.000181     0.000905     0.000260     0.052623     0.012908         0.02
          36    -0.000283     0.000856     0.000216     0.053275     0.012907         0.02
          37    -0.000132     0.000658     0.000169     0.054544     0.012907         0.02
       Unexpected energy increase (0.000107 au)! Trust radius: old=0.1, new=0.1
          38     0.000107     0.000354     0.000136     0.050748     0.012886         0.02
       Converged!

Final summary:
                max(forces, internal): 0.000354 hartree/(bohr,rad)
                rms(forces, internal): 0.000136 hartree/(bohr,rad)
                max(forces,cartesian): 0.000646 hartree/bohr
                rms(forces,cartesian): 0.000185 hartree/bohr
                energy: -591.31773987 hartree

Note that there is also an energy difference of approx. 0.023 hartree. Is this expected?

zubatyuk commented 3 months ago

fixed

cvsik commented 3 months ago

Fix confirmed! Awesome, thanks for the great support on this issue!