NaN results in H2 geometry optimisation

joegilkes commented 2 years ago

Describe the bug I'm writing some code that's doing automated exploration of chemical reaction space, using GFN2-xTB to optimise individual product molecules. On occasion, this involves running geometry optimisations on H₂. Usually this is fine, but on maybe 1% of these occasions, xTB starts throwing NaN values all throughout it's output. I've not observed this with any other molecules running with the same settings, only H₂, and the vast majority of the time the calculation runs fine.

The issue appears to stem from the exact geometry used, as a fresh H₂ geometry with a bond length of 0.7 Ang will work normally while a specific geometry that has failed before will continue to fail on reruns. Furthermore, despite outputting mostly NaNs, the faulty calculation believes it has run successfully, such that it doesn't trigger any errors in my code until it's too late.

To Reproduce Steps to reproduce the behaviour:

happens with input xtbin.xyz (doesn't happen with input h2.xyz)
run xtb --iterations 1000 --opt normal --grad xtbin.xyz
observe output past the Final Singlepoint stage for point of failure

Inputs Sorry for not uploading, Github won't take xyzs.

xtbin.xyz:

    2
  0.000000000000000E+000
H      -2.76516815      1.55822385    -15.39243995
H      -2.75749951      0.78983182    -15.27941415

h2.xyz:

2

H 0.7 0.0 0.0
H 0.0 0.0 0.0

Output See xtb.log past the Final Singlepoint stage.

Expected behaviour Calculation is expected to run as it would when the same command is run with h2.xyz (rough geometry created by hand) instead of xtbin.xyz (geometry created by code).

awvwgk commented 2 years ago

Cannot reproduce this behavior with xtb 6.4.1 release version (Intel 18, Linux, x86_64) or current head xtb 76c6af2 (Intel 2022 or GCC 11.2, Linux, x86_64).

joegilkes commented 2 years ago

Strange, my initial result was with my own compilation of xtb 6.4.1 on Intel 2019 so I downloaded the precompiled release of 6.4.1 to give that a go and it also works fine. Switching back to my own version, that now also runs both examples correctly (as opposed to yesterday when I could run the calculation on xtbin.xyz on repeat and it would always exhibit the failing behaviour).

I've had multiple occurrences of this issue on my version now. I'll try running my code with the precompiled 6.4.1 binary and see if the issue occurs with that.

joegilkes commented 2 years ago

This is still occurring randomly within the precompiled release of xtb 6.4.1, although the error is behaving slightly differently than with my personally compiled version. xtb is encountering the NaNs at the same point in the code (with completely different H₂ geometries), but is actually failing soon after this point instead of continuing as if there wasn't an issue.

Instead of looping through all available iterations in the final singlepoint calculation, xtb now consistently stops after 17 iterations (again, across different geometries) with the following error:

*** Error in `xtb': double free or corruption (!prev): 0x000000000781d130 ***
======= Backtrace: =========
[0x31feb2b]
[0x3203c72]
[0x320495d]
[0x2ff6af7]
[0xbb191c]
[0xbaccce]
[0x8c13ce]
[0x8e12d2]
[0x61c1c8]
[0x9235b9]
[0x71c02c]
[0x40c90d]
[0x42295b]
[0x4020ce]
[0x31d97c0]
[0x401fb7]
======= Memory map: ========
00400000-03689000 r-xp 00000000 00:84 81137997                           /home/chem/msupxs/Software/xtb-6.4.1/bin/xtb
03889000-03abc000 rw-p 03289000 00:84 81137997                           /home/chem/msupxs/Software/xtb-6.4.1/bin/xtb
03abc000-06a85000 rw-p 00000000 00:00 0 
07784000-07fdd000 rw-p 00000000 00:00 0                                  [heap]
2ba215cab000-2ba215efb000 rw-p 00000000 00:00 0 
2ba215efb000-2ba215efc000 ---p 00000000 00:00 0 
2ba215efc000-2ba2162fc000 rw-p 00000000 00:00 0 
2ba2162fc000-2ba2162fd000 ---p 00000000 00:00 0 
2ba2162fd000-2ba2166fd000 rw-p 00000000 00:00 0 
2ba2166fd000-2ba2166fe000 ---p 00000000 00:00 0 
2ba2166fe000-2ba217466000 rw-p 00000000 00:00 0 
2ba218000000-2ba218021000 rw-p 00000000 00:00 0 
2ba218021000-2ba21c000000 ---p 00000000 00:00 0 
2ba21c000000-2ba21c021000 rw-p 00000000 00:00 0 
2ba21c021000-2ba220000000 ---p 00000000 00:00 0 
2ba220000000-2ba220021000 rw-p 00000000 00:00 0 
2ba220021000-2ba224000000 ---p 00000000 00:00 0 
2ba224000000-2ba224037000 rw-p 00000000 00:00 0 
2ba224037000-2ba228000000 ---p 00000000 00:00 0 
7ffe1703e000-7ffe17070000 rw-p 00000000 00:00 0                          [stack]
7ffe171e2000-7ffe171e4000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Aborted

This calculation is being run on the following geometry with the line xtb --iterations 1000 --opt normal --grad xtbin.xyz, and produces this log.

xtbin.xyz:

    2
  0.000000000000000E+000
H      11.81627003      5.56469934      3.50891916
H      11.47419374      6.19141815      3.81465917

joegilkes commented 2 years ago

I've also observed this error with slightly different behaviour, where the calculation fully freezes at the same point in the xTB output (iteration 17 of the final singlepoint calculation) and has to be manually terminated, resulting in the following:

*** Error in `xtb': munmap_chunk(): invalid pointer: 0x00000000033f17c8 ***
======= Backtrace: =========
[0x31feb2b]
[0x3203c72]
[0x2ff5fdb]
[0x2fa9fa3]
[0x2fbbf51]
[0x2fb83f4]
[0x3008d01]
[0x557d27]
[0x31d8320]
[0x31d715d]
[0x313f5a7]
[0x318e819]
[0x31919fa]
[0x30ebe63]
[0x313b639]
[0x31d4b14]
[0x3243b79]
======= Memory map: ========
00400000-03689000 r-xp 00000000 00:84 81137997                           /home/chem/msupxs/Software/xtb-6.4.1/bin/xtb
03889000-03abc000 rw-p 03289000 00:84 81137997                           /home/chem/msupxs/Software/xtb-6.4.1/bin/xtb
03abc000-06a85000 rw-p 00000000 00:00 0 
08593000-08dec000 rw-p 00000000 00:00 0                                  [heap]
2b2d44231000-2b2d44481000 rw-p 00000000 00:00 0 
2b2d44481000-2b2d44482000 ---p 00000000 00:00 0 
2b2d44482000-2b2d44882000 rw-p 00000000 00:00 0 
2b2d44882000-2b2d44883000 ---p 00000000 00:00 0 
2b2d44883000-2b2d44c83000 rw-p 00000000 00:00 0 
2b2d44c83000-2b2d44c84000 ---p 00000000 00:00 0 
2b2d44c84000-2b2d459ec000 rw-p 00000000 00:00 0 
2b2d48000000-2b2d48032000 rw-p 00000000 00:00 0 
2b2d48032000-2b2d4c000000 ---p 00000000 00:00 0 
2b2d50000000-2b2d50021000 rw-p 00000000 00:00 0 
2b2d50021000-2b2d54000000 ---p 00000000 00:00 0 
2b2d54000000-2b2d54021000 rw-p 00000000 00:00 0 
2b2d54021000-2b2d58000000 ---p 00000000 00:00 0 
7ffd39a63000-7ffd39a95000 rw-p 00000000 00:00 0                          [stack]
7ffd39ac8000-7ffd39aca000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

TyBalduf commented 2 years ago

Is this still an issue with the latest version of xTB? If so, we may need more information on the hardware you are running on to see if we can reproduce this error.

joegilkes commented 2 years ago

Just tested with both Conda xTB 6.5.0 and 6.5.1, and with the standalone 6.5.1 release from GitHub. The issue still persists with both Conda versions, although it required new H2 geometries to trigger the issue as the ones I had from when I last tested this now optimise completely fine. Trying the same new geometry with the standalone version runs this optimisation completely fine, but since the geometries that trigger this seem to be fairly random, it could be that the bug would still occur given a different geometry.

I'm going to write up a quick Python script to generate new geometries and keep running xTB to try and replicate the conditions that the geometries are in as closely as possible without having to run the whole project that this is occurring in. I'll update with results soon.

joegilkes commented 2 years ago

Also worth mentioning that Conda xTB 6.5.1 is also sending this to stderr when the usual NaN energy behaviour in the original comment happens: normal termination of xtb Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_DENORMAL

joegilkes commented 2 years ago

After generating thousands of H2 geometry files which match the erroring ones as closely as possible (down to exact whitespace placement and significant figures in the xyz file, and the initial H-H bond length used in the latest erroring geometry), I haven't been able to reproduce this bug once with the Python script (running on the same hardware as before). At this point, I'm at a complete loss for what causes this in the first place, since I don't see how it can be related to the geometry file anymore.

Script I'm using to attempt to find faulty geometries:

import numpy as np
import random
import subprocess
import os

# Desired bond length (Ang)
bl = 0.7766010889417894

if not os.path.exists('calc'):
    os.mkdir('calc')
rmfiles = ['.xtboptok', 'charges', 'wbo', 'xtb.out', 'xtbin.xyz', 'xtbopt.log', 'xtbopt.xyz', 'xtbrestart', 'xtbtopo.mol']
os.chdir('calc')

loop = True
counter = 0
while loop:

    counter += 1
    for rmfile in rmfiles:
        if os.path.isfile(rmfile):
            os.remove(rmfile)

    p1 = np.zeros(3)
    p2 = np.zeros(3)
    for i in range(3):
        p1[i] = random.uniform(-15.0, 15.0)
        p2[i] = random.uniform(-15.0, 15.0)

    uvec = (p2 - p1)/np.linalg.norm(p2 - p1)
    disp = uvec * bl
    p2 = p1 + disp

    with open('xtbin.xyz', 'w') as f:
        f.write('    2\n')
        f.write('  0.000000000000000E+000\n')
        f.write(f'H      {p1[0]:11.8}      {p1[1]:11.8}      {p1[2]:11.8}\n')
        f.write(f'H      {p2[0]:11.8}      {p2[1]:11.8}      {p2[2]:11.8}\n')

    with open('xtb.out', 'w') as f:
        proc = subprocess.run(['xtb' ,'xtbin.xyz', '--opt'], stdout=f)

    with open('xtbopt.xyz', 'r') as f:
        lines = f.readlines()
        cline = lines[1].split()
        if cline[1] == 'NaN':
            print('Energy: NaN')
            print(f'Found after {counter} trials.')
            loop = False
        else:
            print(f'Energy: {cline[1]}')

joegilkes commented 2 years ago

Turns out I was being too impatient - the above script actually does allow me to reproduce the bug on completely different hardware! It does take quite a few trials of randomised H2 xyzs to find faulty geometries (averaged 2900 trials over 10 runs for me, so usually 2-3 minutes on my laptop), but it manages to find a faulty H2 geometry every time.

To better clarify what the script is doing, it's taking the H-H bond length from the latest faulty H2 geometry that I generated in my actual workflow, then iteratively creating random H2 xyzs which share this bond length, writing them out in the exact format that my Fortran code in my workflow uses, running xTB and checking the resulting optimised xyz for a NaN energy. If it finds it, the script stops iterating and displays the number of trials, leaving all the xTB running/log files in the calc directory. If not, it deletes all these files and continues looping.

Since I can get this to happen on other hardware, hopefully it should also be reproducible on your end now.

awvwgk commented 2 years ago

Seems like an issue with the history of the wavefunction, which makes it very hard to reproduce. Try to take the final structure and run xtb xtbopt.xyz --norestart the iterations will converge. Why the RF optimizer can provide a noise wavefunction which explodes in the Broyden mixer is another question. Switching to LBFGS seems to fix the issue by avoiding unnecessary repeated calculations and restarts from the same wavefunction.

joegilkes commented 2 years ago

Thanks for confirming. Switching to LBFGS seems to mitigate this like you said. Are there any performance drawbacks to using LBFGS with xTB or does it perform just as well as the default?

awvwgk commented 2 years ago

The RF based optimizer is the default for historical reasons. My LBFGS implementation should be competitive and is already the default for all systems larger than 500 atoms, so I am confident it can be a robust default.

Changing defaults is always something that should be well though through, but this is up to the project maintainers to decide on.

joegilkes commented 2 years ago

Sorry to report that this issue does still appear to be occurring with LBFGS as well, albeit much less frequently so it's even harder to diagnose.

TyBalduf commented 3 months ago

If I run an opt on H2 with ALPB, I see something very similar with XTB v6.5.1 and v6.7.0. The initial SCC will run fine, but in the first optimization cycle it shows almost immediate problems, with NaN for the energy until it runs out of iterations.

See attached for v6.7.0 output h2.txt

grimme-lab / xtb

NaN results in H2 geometry optimisation #621