Closed jchodera closed 1 year ago
The part about the extension is just a warning. You can ignore it.
Regarding the error, this looks relevant: https://community.khronos.org/t/errorirreducible-controlflow-detected/1986. Is that all we have in the log? Unfortunately, it doesn't give any indication about what kernel is causing the problem.
It's likely one of the Custom forces since the next WU did not have any of those and ran successfully.
Would we need to try to build and run a system with one Force at a time in order to debug? Is there a way to step through compiling kernel by kernel?
If we can add debugging code to the core, we could just have it print out the source of each kernel before compiling it.
We've seen the "Irreducible ControlFlow Detected" message before, though not frequently enough to identify a pattern. Is there a reasonable way to add in-line diagnostic information to that particular error? That just isn't enough information to answer John's question.
Regarding the error, this looks relevant: https://community.khronos.org/t/errorirreducible-controlflow-detected/1986. Is that all we have in the log? Unfortunately, it doesn't give any indication about what kernel is causing the problem.
I searched through all the .cl and .cc for
loops based on that thread and found one instance where the starting condition is specified outside. It should be obvious to the compiler but it is stylistically inconsistent with the rest of the code base.
https://github.com/openmm/openmm/blob/fce26088352b1d5e650b93a8d1e5a57587ae1f64/platforms/common/src/kernels/verlet.cc#L57-L58
I don't know what this while loop does but it looks suspicious since tbx
is unchanged.
https://github.com/openmm/openmm/blob/fce26088352b1d5e650b93a8d1e5a57587ae1f64/platforms/common/src/kernels/gbsaObc.cc#L206-L220
customGBValueN2.cc and nonbonded.cl have similar loops. The CPU version looks more likely to break out of the loop. https://github.com/openmm/openmm/blob/fce26088352b1d5e650b93a8d1e5a57587ae1f64/platforms/common/src/kernels/gbsaObc_cpu.cc#L212-L221
That loop is scanning through the exclusionTiles
array to find a particular index. The exit condition isn't based on tbx
changing. It's based on the values of the latest data that got loaded into skipTiles
.
Did the system with the error involve implicit solvent? What integrator did it use? That will tell us whether the above code could be related.
This was FAH project 13438, for the COVID Moonshot, which involves a hybrid alchemical system with a good number of Custom*Force
terms and NonbondedForce
perturbation groups.
I've attached serialized XML files of the RUN that failed if that is of interest.
It doesn't have a GBSAOBCForce, and it uses a CustomIntegrator instead of a VerletIntegrator. So none of the loops mentioned above is involved.
I can't reproduce this on an AMD Navi GPU. The following script runs without problems.
from simtk.openmm import *
system = XmlSerializer.deserialize(open('system.xml').read())
integrator = XmlSerializer.deserialize(open('integrator.xml').read())
state = XmlSerializer.deserialize(open('state.xml').read())
context = Context(system, integrator, Platform.getPlatformByName('OpenCL'))
context.setState(state)
context.getState(getForces=True)
It's a different GPU of course, and also a different OS (Ubuntu 20.04). Cape Verde is a pretty old GPU, released in 2012.
Thanks for trying this out!
Is there any instrumentation we can add to the core to bring back more information?
Failing that, we will keep trying to find someone experiencing this issue.
It's unclear to me whether Cape Verde refers to the first release in 2012 or the architecture, which has been in production for many years.
Hm, I might be misreading the info about which GPUs featured Cape Verde: https://www.techpowerup.com/gpu-specs/amd-cape-verde.g100
Cape Verde was a specific GPU. It was based on the GCN 1.0 architecture.
If you can find someone who is experiencing the problem, we definitely could create an instrumented core that would provide more information.
I have a cape verde card and am currently experiencing this issue on Windows.
I wasn't getting it last week, I think my last GPU work unit was completed Friday night and nothing has changes as far as I know since then. I have both windows updates and my graphics drivers updates configured to notify me of available updates but not install anything automatically so I feel confident that.
I'd be happy to run a modified version of FAH to gather more information about this issue. Given that this started without any software changes it is possible this issue is related to the work units being issued by the server or something else that could change and 'fix' itself on it's own so we'll have to cross our fingers it continues long enough to test.
Great! @jchodera are you set up for building cores? Here are the lines where it compiles kernels:
Immediately before those lines, add the line
cout<<src.str()<<endl;
That will make it print the source for each kernel to the console (which I believe gets redirected to one of the logs?) just before attempting to compile it. Then we can see what the last kernel was it attempted to compile.
I haven't done any sort of development for the project so I'm not sure if I'm setup for building cores. Based on the log messages it seems like the folding at home client may be doing the building for me. If you can point me to the common location for the core code I'd be happy to modify it and see what I get in my logs.
I just realized that I have the file path from the logs, so that solves that. However that's a temp directory and no longer exists for me. It seems like the folding at home client is downloading the kernel code and cleaning it up rather quickly so I'm not sure the best way to jump in an interfere. Any suggestions?
Apologies we haven't been able to make progress on this yet. We're still working on automating core22 builds with @dotsdl but hope to have something soon we can use to help debug this.
This seems to be specific to Custom*Force
s, since it's only appearing with my COVID Moonshot alchemical free energy calculations.
This issue may be related?
I am still seeing some of these errors on the Radeon 7770 HD under Windows 10 on project 13446.
22:20:16:WU00:FS00:0x22:*************************** Core22 Folding@home Core ***************************
22:20:16:WU00:FS00:0x22: Core: Core22
22:20:16:WU00:FS00:0x22: Type: 0x22
22:20:16:WU00:FS00:0x22: Version: 0.0.13
(....)
22:20:16:WU00:FS00:0x22:************************************ OpenMM ************************************
22:20:16:WU00:FS00:0x22: Revision: 189320d0
22:20:16:WU00:FS00:0x22:********************************************************************************
22:20:16:WU00:FS00:0x22:Project: 13446 (Run 6351, Clone 17, Gen 0)
22:20:16:WU00:FS00:0x22:Unit: 0x00000000000000000000000000000000
22:20:16:WU00:FS00:0x22:Reading tar file core.xml
22:20:16:WU00:FS00:0x22:Reading tar file integrator.xml.bz2
22:20:16:WU00:FS00:0x22:Reading tar file state.xml.bz2
22:20:16:WU00:FS00:0x22:Reading tar file system.xml.bz2
22:20:16:WU00:FS00:0x22:Digital signatures verified
22:20:16:WU00:FS00:0x22:Folding@home GPU Core22 Folding@home Core
22:20:16:WU00:FS00:0x22:Version 0.0.13
22:20:17:WU00:FS00:0x22: Checkpoint write interval: 50000 steps (5%) [20 total]
22:20:17:WU00:FS00:0x22: JSON viewer frame write interval: 10000 steps (1%) [100 total]
22:20:17:WU00:FS00:0x22: XTC frame write interval: 250000 steps (25%) [4 total]
22:20:17:WU00:FS00:0x22: Global context and integrator variables write interval: 25000 steps (2.5%) [40 total]
22:20:17:WU00:FS00:0x22:There are 3 platforms available.
22:20:17:WU00:FS00:0x22:Platform 0: Reference
22:20:17:WU00:FS00:0x22:Platform 1: CPU
22:20:17:WU00:FS00:0x22:Platform 2: OpenCL
22:20:17:WU00:FS00:0x22: opencl-device 0 specified
22:20:34:WU00:FS00:0x22:Attempting to create OpenCL context:
22:20:34:WU00:FS00:0x22: Configuring platform OpenCL
22:20:42:WU00:FS00:0x22:Failed to create OpenCL context:
22:20:42:WU00:FS00:0x22:Error compiling kernel: "C:\Users\admin\AppData\Local\Temp\OCL6916T24.cl", line 21: warning: OpenCL
22:20:42:WU00:FS00:0x22: extension is now part of core
22:20:42:WU00:FS00:0x22: #pragma OPENCL EXTENSION cl_khr_fp64 : enable
22:20:42:WU00:FS00:0x22: ^
22:20:42:WU00:FS00:0x22:
22:20:42:WU00:FS00:0x22:Error:E010:Irreducible ControlFlow Detected
22:20:42:WU00:FS00:0x22:ERROR:125: Failed to create a GPU-enabled OpenMM Context.
22:20:42:WU00:FS00:0x22:Saving result file ..\logfile_01.txt
22:20:42:WU00:FS00:0x22:Saving result file science.log
22:20:42:WU00:FS00:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
22:20:42:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
22:20:42:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:13446 run:6351 clone:17 gen:0 core:0x22 unit:0x000000110000000000003486000018cf
22:20:42:WU00:FS00:Uploading 2.82KiB to 54.157.202.86
22:20:42:WU00:FS00:Connecting to 54.157.202.86:8080
22:20:43:WU00:FS00:Upload complete
22:20:43:WU00:FS00:Server responded WORK_ACK (400)
22:20:43:WU00:FS00:Cleaning up
@peastman Did we ever figure out where this is coming from? I'm still seeing a ton of this on Folding@home.
Not that I know of. I gave some suggestions above on how we could begin tracking it down.
Double precision FP was an extension to OpenCL 1.0 and 1.1. It became an optional part of OpenCL 1.2 but the the extension was kept for backwards compatibility. Alternatively, clGetDeviceInfo can be used to check that CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE
is greater than 0 to confirm a device supports double precision FP. Listing cl_khr_fp64
in CL_DEVICE_EXTENSIONS
is still required in OpenCL 3.0 (pg. 77) so it will continue to be valid as a check for double precision.
An overzealous driver was probably to blame for throwing a warning when cl_khr_fp64
was explicitly enabled on OpenCL 1.2+.
https://github.com/openmm/openmm/blob/76520ce48ffcc667eec088f5d292ef6ca238353e/platforms/opencl/src/OpenCLContext.cpp#L606-L607
Wrapping this pragma inside an OpenCL version check may avoid having the issue reappear.
A PR removing the pragma would be welcome! There's no need for a version check. We don't support versions earlier than 1.2 anymore.
Any idea what might cause an error like this (on the Folding@home version, core22 0.0.14)?
The configuration is:
cc: https://foldingforum.org/viewtopic.php?p=348173#p348173