openmm / openmm

OpenMM is a toolkit for molecular simulation using high performance GPU code.
1.48k stars 521 forks source link

Error while running OpenMM #4580

Open mljchen102 opened 3 months ago

mljchen102 commented 3 months ago

In the middle of running an OpenMM simulation, the following error shows up. How do I resolve this issue?

Traceback (most recent call last): File "/home/michael/cadherin/preprod/openmm_NPT.py", line 91, in simulation.step(mdsteps) File "/home/michael/anaconda3/envs/openmm8/lib/python3.11/site-packages/openmm/app/simulation.py", line 141, in step self._simulate(endStep=self.currentStep+steps) File "/home/michael/anaconda3/envs/openmm8/lib/python3.11/site-packages/openmm/app/simulation.py", line 241, in _simulate self._generate_reports(wrapped, True) File "/home/michael/anaconda3/envs/openmm8/lib/python3.11/site-packages/openmm/app/simulation.py", line 263, in _generate_reports reporter.report(self, state) File "/home/michael/anaconda3/envs/openmm8/lib/python3.11/site-packages/openmm/app/dcdreporter.py", line 107, in report self._dcd.writeModel(state.getPositions(), periodicBoxVectors=state.getPeriodicBoxVectors()) File "/home/michael/anaconda3/envs/openmm8/lib/python3.11/site-packages/openmm/app/dcdfile.py", line 170, in writeModel data = array.array('f', (10x[i] for x in positions)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/michael/anaconda3/envs/openmm8/lib/python3.11/site-packages/openmm/app/dcdfile.py", line 170, in data = array.array('f', (10x[i] for x in positions)) ~^^^ TypeError: 'function' object is not subscriptable

peastman commented 3 months ago

Can you provide a script and input files to reproduce this?

mljchen102 commented 3 months ago

@peastman my file size is too big to upload. Can I somehow send them to you directly?

peastman commented 3 months ago

Can you post it online somewhere (google drive, dropbox, etc.) and post the link?

mljchen102 commented 3 months ago

https://uofi.box.com/s/3b48kjpn71luvzkena2fzzhwsjiy73ff

peastman commented 3 months ago

When I go to that url, it wants me to log in with a box account. Is there a way to get it without having an account?

mljchen102 commented 3 months ago

Try this instead https://drive.google.com/file/d/1BEclZjySODQyh1GgvJwtUwsf0QRlxpFD/view?usp=drive_link

peastman commented 3 months ago

Got it, thanks. I'll take a look.

peastman commented 3 months ago

When you create your Simulation you specify state=load_state_path. But the variable load_state_path is never defined, and you didn't include a saved state with the files. Perhaps you left it out by mistake?

mljchen102 commented 3 months ago

I had gotten the same error when load_state_path was removed from the script file

peastman commented 3 months ago

I can't load your checkpoint file. Checkpoints are specific to the exact hardware they were created on. You need to use a State instead.

mljchen102 commented 2 months ago

I updated the files in the folder, see the link https://drive.google.com/file/d/1cIA6dKL0-tHMe20OUMWXp7yrPP8-Qu6w/view?usp=sharing

Can you help me with the equil/Production file instead? That is more pressing for now. I think something is messed up with OpenMM on my computer but I'm not sure what the issue is. I've tried removing the conda environment and doing a fresh OpenMM 8.0 install with Python versions 8, 9, 11 but keep on getting the same issue. I then tried going back to my OpenMM 7.7 version but also got the same issue. I then tried OpenMM 8.1.1 but no luck either.

peastman commented 2 months ago

It runs fine on my computer. I agree it sounds like a problem with your installation. Is it possible you have a second copy of OpenMM on your computer, and you're actually using a different version than you think you are? It also could be mixing two versions together. For example, the Python module from one version could be linking to the native libraries for a different version, or it could be loading the plugins from a different version.

Start by checking your LD_LIBRARY_PATH to make sure you aren't including anything that would pull in a different copy of OpenMM.

If that looks ok, launch the Python interpreter and try running the following commands.

import os
print(os.getenv('OPENMM_PLUGIN_DIR'))
import openmm
print(openmm.version.openmm_library_path)
print(openmm.__file__)

Is everything coming from the correct copy?

mljchen102 commented 2 months ago

It appears to be coming from the correct copy. Something else is that I recently reverted my linux kernel to 5.15 before doing the conda environment changes I stated previously. Are there files I will need to remake? I'm still getting the same error even when i start over from the beginning for my minimization/NVT

peastman commented 2 months ago

It's hard to see how that would cause this error. Then again, it's hard to see how anything would cause this error. It looks as if positions contains function objects, which absolutely should not be the case, since it was just retrieved from a State object. That's why it seems like data is getting messed up between incompatible versions of libraries.

Try editing dcdfile.py in your conda environment (not generally a good idea, but we're debugging an already broken environment) to print out the value of positions just before the loop where the error happens. What does it print?

mljchen102 commented 2 months ago

Here is a brief output [Vec3(x=2.79349946975708, y=10.828460693359375, z=9.947171211242676), Vec3(x=2.6831870079040527, y=10.824323654174805, z=9.93463134765625), Vec3(x=2.813645124435425, y=10.837196350097656, z=10.056079864501953), Vec3(x=2.8468518257141113, y=10.738536834716797, z=9.909615516662598), Vec3(x=2.8609540462493896, y=10.945727348327637, z=9.875526428222656), Vec3(x=2.9767136573791504, y=10.93915843963623, z=9.844376564025879), Vec3(x=2.789220094680786, y=11.046767234802246, z=9.851170539855957), Vec3(x=2.695344924926758, y=11.033442497253418, z=9.88199234008789), Vec3(x=2.823190450668335, y=11.16512680053711, z=9.782607078552246), Vec3(x=2.9189534187316895, y=11.164416313171387, z=9.732677459716797), Vec3(x=2.7347543239593506, y=11.201864242553711, z=9.659412384033203), Vec3(x=2.7964799404144287, y=11.284221649169922, z=9.617575645446777), Vec3(x=2.6102473735809326, y=11.257064819335938, z=9.703519821166992), Vec3(x=2.580277681350708, y=11.189674377441406, z=9.764970779418945), Vec3(x=2.712392568588257, y=11.080963134765625, z=9.562405586242676), Vec3(x=2.6784112453460693, y=11.105581283569336, z=9.459534645080566), Vec3(x=2.6294076442718506, y=11.016962051391602, z=9.599291801452637), Vec3(x=2.8063066005706787, y=11.021622657775879, z=9.560938835144043), Vec3(x=2.8174784183502197, y=11.276100158691406, z=9.896248817443848), ...

peastman commented 2 months ago

That looks normal at least. This line

data = array.array('f', (10*x[i] for x in positions))

is throwing the exception

TypeError: 'function' object is not subscriptable

Since x is the only thing that gets subscripted, I'm assuming that's the cause of the error. But maybe it's something inside array(), and for some reason it's getting omitted from the stack trace? Let's just print out everything we can to figure out where it's really coming from.

print((x for x in positions))
print((x[0] for x in positions))
print((10*x[0] for x in positions))
print(array.array('f', (10*x[i] for x in positions)))

One of those lines should throw an exception.

mljchen102 commented 2 months ago

This somehow started working again so I'm going to close it for now, thanks again.

mljchen102 commented 2 months ago

So I thought it was fixed but doesn't look like it. I added those commands mentioned above and still got the same error

data = array.array('f', (10*x[i] for x in positions)) ~^^^ TypeError: 'function' object is not subscriptable

peastman commented 2 months ago

Do you mean that none of those lines threw an exception? Even

print(array.array('f', (10*x[i] for x in positions)))

does not throw an exception, but

data = array.array('f', (10*x[i] for x in positions))

does? If so I'm totally confused about what could possibly be happening.

mljchen102 commented 2 months ago

Separate question. I'm trying out OpenMM version 7.7 with python version 3.9 to see if it would change anything. I'm getting "segmentation fault (core dumped)" issues randomly while the simulation is running. Is there a way to fix it?

I've also tried reinstalling my cuda driver, removing all openmm environments and reinstalling a new environment but these haven't worked either. I don't know if I should reinstall Linux or if there is anything else to try.

peastman commented 2 months ago

Segmentation fault means something is very messed up somewhere deep down. It generally indicates things like memory corruption, incompatible libraries, etc. :(

I really have no idea what's going on. You're seeing all these weird errors that haven't been reported by anyone else. Do you have any other problems like this on your computer?

mljchen102 commented 2 months ago

I'm only seeing this on my Linux computer I think. If I try running on another person's computer and in their conda environment, everything worked fine. I did try copying/transferring this other person's conda environment to mine but it didn't work. I'm also very confused what's going on. These errors would randomly pop up during the simulation, sometimes a third way through or halfway through.

peastman commented 2 months ago

It's entirely possible this is a hardware problem. Random errors that appear midway through an intensive computation are common effects of overclocking, insufficient cooling, insufficient power supply, bad memory chips, and things like that. But I'm really just guessing wildly here.

mljchen102 commented 2 months ago

What is your advice for what I should try next?

peastman commented 2 months ago

Do you see similar problems when doing other things on the same computer, especially computationally intensive tasks? Or are the problems limited to OpenMM?

mljchen102 commented 2 months ago

Right now I'm only seeing these issues show up when running OpenMM. I haven't seen any problems when running other computationally intensive tasks

peastman commented 2 months ago

Can you describe your hardware? Is it a laptop, desktop, workstation, rack mounted server? What model of CPU and GPU do you have? (The exact model number, not just a branding name like "core i7".) What size power supply do you have? Is the computer overclocked? Can you monitor the CPU temperature while the simulation runs and see if it's staying in the recommended range?

mljchen102 commented 2 months ago

I'm running on a Desktop. CPU is Intel Core i9-14900KF (24 cores, 2 threads/core). GPU is Nvidia GeForce RTX 4090. Power supply is 1000W. I don't think it is overclocked but how should I check? Yes, I can monitor the temp.

One thing I noticed (not sure how much it impacts the issue) but when I try nvcc --version and compared it to the cudatoolkit version installed in the conda environment, it looks like these are slightly different (11.5 vs 11.8).

peastman commented 2 months ago

Just checking in on this. Did you try monitoring the CPU temperature? What did you find?

It may be significant that your CPU is one of the models affected by a microcode bug that causes crashes and permanent damage. It's entirely possible that's the cause of your problems. If so, Intel will replace your CPU.

mljchen102 commented 2 months ago

I saw temps get up to 90-100C range quite a few times.

peastman commented 2 months ago

I think my advice would be to wait for Intel to release the patch for the bug. Until then, don't run any intensive computations on that computer to avoid risking damage to it.