Closed cvsik closed 3 months ago
What is your pytorch version? Do you use CUDA? It works for me both on CPU and GPU.
pytorch 2.3.1 py3.11_cuda12.1_cudnn8.9.2_0 pytorch
pytorch-cluster 1.6.3 py311_torch_2.3.0_cu121 pyg
pytorch-cuda 12.1 ha16c6d3_5 pytorch
I'm using these versions from pip (followed the torch installation instructions from their site):
torch 2.3.1+cu118 pypi_0 pypi
torch-cluster 1.6.3+pt23cu118 pypi_0 pypi
torchaudio 2.3.1+cu118 pypi_0 pypi
torchvision 0.18.1+cu118 pypi_0 pypi
And yes, I'm running on CUDA, however, I get the same error on CPU. I'll investigate this further, maybe nuke my conda environment and start from scratch.
I get the same error with these package sources:
pytorch 2.3.1 py3.11_cuda12.1_cudnn8.9.2_0 pytorch
pytorch-cluster 1.6.3 py311_torch_2.3.0_cu121 pyg
pytorch-cuda 12.1 ha16c6d3_5 pytorch
pytorch-mutex 1.0 cuda pytorch
torchtriton 2.3.1 py311 pytorch
The only spot where torch.cdist
occurs (in aimnet2calc) is in nblists_torch_pbc
. As far as I know, the computation failing above is not using PBC, so I'm wondering why torch.cdist
even occurs in the graph. I'll investigate further 😁
@cvsik I cannot reproduce your error yet. It works for with with pytorch 2.3 and 2.2. I'll investigate further.
Maybe another piece of information: Here's the SHA256 of the model I'm using:
4d93cacf3e8517c5ddd79e9dae1a0482cb722c3d6d6f86f2b2082abbb3c436e4 assets/aimnet2/aimnet2_b973c_0.jpt
Are you using the same model file? Could that be a possible source for the failure?
Just in case, can you delete this file. It will be re-downloaded on first call. Need to add a function to clean up model cache.
On Thu, Jul 11, 2024, 11:42 AM Maximilian Scheurer @.***> wrote:
Maybe another piece of information: Here's the SHA256 of the model I'm using: 4d93cacf3e8517c5ddd79e9dae1a0482cb722c3d6d6f86f2b2082abbb3c436e4 assets/aimnet2/aimnet2_b973c_0.jpt
Are you using the same model file? Could that be a possible source for the failure?
— Reply to this email directly, view it on GitHub https://github.com/isayevlab/AIMNet2/issues/25#issuecomment-2223521323, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7FDQYQFQJSE5PSB6KI3VDZL27ZLAVCNFSM6AAAAABKSG2B6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRTGUZDCMZSGM . You are receiving this because you commented.Message ID: @.***>
Removed the file, but that didn't change anything :(
@cvsik to investigate, please do:
show output of command
python -c "import torch; print(torch.__version__); print(torch.cuda.get_device_name())"
try with these jpt files (download and provide full path in AIMNet2Caclculator or in model
option in pysis yaml):
https://drive.google.com/file/d/1zfVbAFl9hDtchgDcLfj5Bk-5j9JPhuZq/view?usp=sharing
https://drive.google.com/file/d/16yefZW8OqkRV7xeRS5Qhk4-YhpAz_11j/view?usp=sharing
@cvsik to investigate, please do: show output of command
python -c "import torch; print(torch.__version__); print(torch.cuda.get_device_name())"
On the GPU device, I get:
2.3.1+cu118
NVIDIA A100-SXM4-40GB
try with these jpt files (download and provide full path in AIMNet2Caclculator or in
model
option in pysis yaml): https://drive.google.com/file/d/1zfVbAFl9hDtchgDcLfj5Bk-5j9JPhuZq/view?usp=sharing https://drive.google.com/file/d/16yefZW8OqkRV7xeRS5Qhk4-YhpAz_11j/view?usp=sharing
The file t_a.jpt
fails to compupte a Hessian on both CPU and GPU.
The file t_b.jpt
succeeds to compute a Hessian on both CPU and GPU (the geometry optimization with that model takes a couple more iterations though).
Updated models in aimnet-model-zoo repository. Clean your local cache of jpt files to re-download on first use. Please confirm that the solution works.
Thanks! The new files seem to work.
However, I'm a bit worried about the optimization convergence now. With the old files, the optimization output was
Spent 0.0 s preparing the first cycle.
cycle Δ(energy) max(|force|) rms(force) max(|step|) rms(step) s/cycle
0 nan 0.018289 0.004151 0.060453 0.020601 0.64
1 -0.001423 0.011384 0.001937 0.086822 0.025547 0.12
2 -0.000734 0.004380 0.001427 0.101807 0.031504 0.14
3 -0.000525 0.006071 0.001648 0.172282 0.054327 0.13
4 -0.000157 0.007260 0.001905 0.108513 0.026761 0.07
5 -0.000476 0.005253 0.001164 0.028623 0.011377 0.16
6 -0.000164 0.002196 0.000491 0.036787 0.011066 0.15
7 -0.000055 0.000669 0.000231 0.026117 0.010106 0.15
8 -0.000029 0.000477 0.000170 0.021280 0.007841 0.18
Converged!
Final summary:
max(forces, internal): 0.000477 hartree/(bohr,rad)
rms(forces, internal): 0.000170 hartree/(bohr,rad)
max(forces,cartesian): 0.001113 hartree/bohr
rms(forces,cartesian): 0.000316 hartree/bohr
energy: -591.29441587 hartree
and with the new files:
Spent 0.0 s preparing the first cycle.
cycle Δ(energy) max(|force|) rms(force) max(|step|) rms(step) s/cycle
0 nan 0.017750 0.004166 0.086080 0.024756 1.02
1 -0.001863 0.011136 0.002021 0.122529 0.038297 0.05
2 -0.000282 0.017567 0.003144 0.042002 0.012152 0.02
3 -0.000923 0.004899 0.001212 0.069663 0.016265 0.02
4 -0.000374 0.006398 0.001196 0.070393 0.016137 0.02
5 -0.000625 0.001789 0.000775 0.118025 0.038432 0.02
6 -0.003792 0.020059 0.003927 0.107212 0.032400 0.02
Unexpected energy increase (0.004194 au)! Trust radius: old=0.25, new=0.1
7 0.004194 0.007814 0.002283 0.046027 0.012910 0.02
8 -0.000811 0.002656 0.001077 0.054215 0.022201 0.02
9 -0.000512 0.003491 0.001056 0.038985 0.012524 0.02
-----------------------------------------------------------------------------------
10 -0.003114 0.006746 0.001384 0.066427 0.019017 0.02
Unexpected energy increase (0.002207 au)! Trust radius: old=0.2, new=0.1
11 0.002207 0.003754 0.000945 0.029394 0.012864 0.02
Unexpected energy increase (0.000547 au)! Trust radius: old=0.1, new=0.1
12 0.000547 0.003048 0.000814 0.037817 0.012920 0.02
13 -0.000121 0.002307 0.000780 0.047526 0.012897 0.02
14 -0.000125 0.002052 0.000709 0.097302 0.026686 0.02
15 -0.000547 0.002337 0.000502 0.091989 0.025495 0.02
16 -0.000661 0.002088 0.000439 0.091971 0.025640 0.02
17 -0.000267 0.001495 0.000536 0.096718 0.025820 0.02
Unexpected energy increase (0.000685 au)! Trust radius: old=0.2, new=0.1
18 0.000685 0.045257 0.009926 0.055442 0.012909 0.02
19 -0.002688 0.021798 0.003942 0.055459 0.012919 0.02
Unexpected energy increase (0.002269 au)! Trust radius: old=0.1, new=0.1
-----------------------------------------------------------------------------------
20 0.002269 0.005212 0.001000 0.044134 0.012881 0.02
21 -0.000755 0.000827 0.000393 0.045115 0.012767 0.02
22 -0.000202 0.001580 0.000425 0.045448 0.012801 0.02
23 -0.000205 0.002629 0.000565 0.046364 0.013046 0.02
24 -0.000217 0.002733 0.000620 0.046204 0.012788 0.02
25 -0.000175 0.002225 0.000636 0.047212 0.012895 0.02
26 -0.000222 0.001328 0.000580 0.047433 0.012845 0.02
27 -0.000226 0.001398 0.000586 0.040439 0.012961 0.02
Unexpected energy increase (0.000354 au)! Trust radius: old=0.1, new=0.1
28 0.000354 0.003261 0.001228 0.042837 0.012915 0.02
29 -0.000370 0.001378 0.000515 0.046827 0.012784 0.02
-----------------------------------------------------------------------------------
30 -0.000132 0.001411 0.000407 0.051432 0.012808 0.02
31 -0.000237 0.001372 0.000354 0.052238 0.012887 0.02
32 -0.000229 0.001312 0.000315 0.054556 0.012901 0.02
33 -0.000218 0.001138 0.000288 0.052642 0.012907 0.02
34 -0.000200 0.001119 0.000275 0.054589 0.012907 0.02
35 -0.000181 0.000905 0.000260 0.052623 0.012908 0.02
36 -0.000283 0.000856 0.000216 0.053275 0.012907 0.02
37 -0.000132 0.000658 0.000169 0.054544 0.012907 0.02
Unexpected energy increase (0.000107 au)! Trust radius: old=0.1, new=0.1
38 0.000107 0.000354 0.000136 0.050748 0.012886 0.02
Converged!
Final summary:
max(forces, internal): 0.000354 hartree/(bohr,rad)
rms(forces, internal): 0.000136 hartree/(bohr,rad)
max(forces,cartesian): 0.000646 hartree/bohr
rms(forces,cartesian): 0.000185 hartree/bohr
energy: -591.31773987 hartree
Note that there is also an energy difference of approx. 0.023
hartree. Is this expected?
fixed
Fix confirmed! Awesome, thanks for the great support on this issue!
Hi again,
are there any special things to consider when requesting a Hessian?
I tried this with the
mol_single_opt.yml
example, requesting a Hessian computation after optimization:After the optimization is done, I get the following traceback:
Any idea how to get around the missing deriv. impl. in pytorch? Thanks!