openmm / openmm

OpenMM is a toolkit for molecular simulation using high performance GPU code.
1.5k stars 525 forks source link

OpenCL error: Irreducible ControlFlow Detected #2986

Closed jchodera closed 1 year ago

jchodera commented 3 years ago

Any idea what might cause an error like this (on the Folding@home version, core22 0.0.14)?

Failed to create OpenCL context:
Error compiling kernel: "C:\Users\Owner\AppData\Local\Temp\OCL5264T24.cl", line 21: warning: OpenCL
          extension is now part of core
  #pragma OPENCL EXTENSION cl_khr_fp64 : enable
                           ^

Error:E010:Irreducible ControlFlow Detected

The configuration is:

************************************ System ************************************
        CPU: Intel(R) Pentium(R) CPU G840 @ 2.80GHz
     CPU ID: GenuineIntel Family 6 Model 42 Stepping 7
       CPUs: 2
     Memory: 15.98GiB
Free Memory: 11.53GiB
    Threads: WINDOWS_THREADS
 OS Version: 6.2
Has Battery: false
 On Battery: false
 UTC Offset: -5
        PID: 5264
        CWD: C:\ProgramData\FAHClient\work
************************************ OpenMM ************************************
   Revision: 189320d0
********************************************************************************
  -- 0 --
  PROFILE = FULL_PROFILE
  VERSION = OpenCL 2.1 AMD-APP (3188.4)
  NAME = AMD Accelerated Parallel Processing
  VENDOR = Advanced Micro Devices, Inc.

(1) device(s) found on platform 0:
  -- 0 --
  DEVICE_NAME = Capeverde
  DEVICE_VENDOR = Advanced Micro Devices, Inc.
  DEVICE_VERSION = OpenCL 1.2 AMD-APP (3188.4)
  DRIVER_VERSION = 3188.4

cc: https://foldingforum.org/viewtopic.php?p=348173#p348173

peastman commented 3 years ago

The part about the extension is just a warning. You can ignore it.

Regarding the error, this looks relevant: https://community.khronos.org/t/errorirreducible-controlflow-detected/1986. Is that all we have in the log? Unfortunately, it doesn't give any indication about what kernel is causing the problem.

jchodera commented 3 years ago

It's likely one of the Custom forces since the next WU did not have any of those and ran successfully.

Would we need to try to build and run a system with one Force at a time in order to debug? Is there a way to step through compiling kernel by kernel?

peastman commented 3 years ago

If we can add debugging code to the core, we could just have it print out the source of each kernel before compiling it.

bb30994 commented 3 years ago

We've seen the "Irreducible ControlFlow Detected" message before, though not frequently enough to identify a pattern. Is there a reasonable way to add in-line diagnostic information to that particular error? That just isn't enough information to answer John's question.

bdenhollander commented 3 years ago

Regarding the error, this looks relevant: https://community.khronos.org/t/errorirreducible-controlflow-detected/1986. Is that all we have in the log? Unfortunately, it doesn't give any indication about what kernel is causing the problem.

I searched through all the .cl and .cc for loops based on that thread and found one instance where the starting condition is specified outside. It should be obvious to the compiler but it is stylistically inconsistent with the rest of the code base. https://github.com/openmm/openmm/blob/fce26088352b1d5e650b93a8d1e5a57587ae1f64/platforms/common/src/kernels/verlet.cc#L57-L58

I don't know what this while loop does but it looks suspicious since tbx is unchanged. https://github.com/openmm/openmm/blob/fce26088352b1d5e650b93a8d1e5a57587ae1f64/platforms/common/src/kernels/gbsaObc.cc#L206-L220

customGBValueN2.cc and nonbonded.cl have similar loops. The CPU version looks more likely to break out of the loop. https://github.com/openmm/openmm/blob/fce26088352b1d5e650b93a8d1e5a57587ae1f64/platforms/common/src/kernels/gbsaObc_cpu.cc#L212-L221

peastman commented 3 years ago

That loop is scanning through the exclusionTiles array to find a particular index. The exit condition isn't based on tbx changing. It's based on the values of the latest data that got loaded into skipTiles.

Did the system with the error involve implicit solvent? What integrator did it use? That will tell us whether the above code could be related.

jchodera commented 3 years ago

This was FAH project 13438, for the COVID Moonshot, which involves a hybrid alchemical system with a good number of Custom*Force terms and NonbondedForce perturbation groups.

I've attached serialized XML files of the RUN that failed if that is of interest.

PROJ13438-RUN12681.zip

peastman commented 3 years ago

It doesn't have a GBSAOBCForce, and it uses a CustomIntegrator instead of a VerletIntegrator. So none of the loops mentioned above is involved.

peastman commented 3 years ago

I can't reproduce this on an AMD Navi GPU. The following script runs without problems.

from simtk.openmm import *

system = XmlSerializer.deserialize(open('system.xml').read())
integrator = XmlSerializer.deserialize(open('integrator.xml').read())
state = XmlSerializer.deserialize(open('state.xml').read())
context = Context(system, integrator, Platform.getPlatformByName('OpenCL'))
context.setState(state)
context.getState(getForces=True)

It's a different GPU of course, and also a different OS (Ubuntu 20.04). Cape Verde is a pretty old GPU, released in 2012.

jchodera commented 3 years ago

Thanks for trying this out!

Is there any instrumentation we can add to the core to bring back more information?

Failing that, we will keep trying to find someone experiencing this issue.

It's unclear to me whether Cape Verde refers to the first release in 2012 or the architecture, which has been in production for many years.

jchodera commented 3 years ago

Hm, I might be misreading the info about which GPUs featured Cape Verde: https://www.techpowerup.com/gpu-specs/amd-cape-verde.g100

peastman commented 3 years ago

Cape Verde was a specific GPU. It was based on the GCN 1.0 architecture.

If you can find someone who is experiencing the problem, we definitely could create an instrumented core that would provide more information.

weisspe commented 3 years ago

I have a cape verde card and am currently experiencing this issue on Windows.

I wasn't getting it last week, I think my last GPU work unit was completed Friday night and nothing has changes as far as I know since then. I have both windows updates and my graphics drivers updates configured to notify me of available updates but not install anything automatically so I feel confident that.

I'd be happy to run a modified version of FAH to gather more information about this issue. Given that this started without any software changes it is possible this issue is related to the work units being issued by the server or something else that could change and 'fix' itself on it's own so we'll have to cross our fingers it continues long enough to test.

peastman commented 3 years ago

Great! @jchodera are you set up for building cores? Here are the lines where it compiles kernels:

https://github.com/openmm/openmm/blob/9008050c9e6ccb477442d0db1fee6e53782f4cbf/platforms/opencl/src/OpenCLContext.cpp#L616-L622

Immediately before those lines, add the line

cout<<src.str()<<endl;

That will make it print the source for each kernel to the console (which I believe gets redirected to one of the logs?) just before attempting to compile it. Then we can see what the last kernel was it attempted to compile.

weisspe commented 3 years ago

I haven't done any sort of development for the project so I'm not sure if I'm setup for building cores. Based on the log messages it seems like the folding at home client may be doing the building for me. If you can point me to the common location for the core code I'd be happy to modify it and see what I get in my logs.

weisspe commented 3 years ago

I just realized that I have the file path from the logs, so that solves that. However that's a temp directory and no longer exists for me. It seems like the folding at home client is downloading the kernel code and cleaning it up rather quickly so I'm not sure the best way to jump in an interfere. Any suggestions?

jchodera commented 3 years ago

Apologies we haven't been able to make progress on this yet. We're still working on automating core22 builds with @dotsdl but hope to have something soon we can use to help debug this.

jchodera commented 3 years ago

This seems to be specific to Custom*Forces, since it's only appearing with my COVID Moonshot alchemical free energy calculations.

This issue may be related?

gunnarre commented 3 years ago

I am still seeing some of these errors on the Radeon 7770 HD under Windows 10 on project 13446.

22:20:16:WU00:FS00:0x22:*************************** Core22 Folding@home Core ***************************
22:20:16:WU00:FS00:0x22:       Core: Core22
22:20:16:WU00:FS00:0x22:       Type: 0x22
22:20:16:WU00:FS00:0x22:    Version: 0.0.13
(....)
22:20:16:WU00:FS00:0x22:************************************ OpenMM ************************************
22:20:16:WU00:FS00:0x22:   Revision: 189320d0
22:20:16:WU00:FS00:0x22:********************************************************************************
22:20:16:WU00:FS00:0x22:Project: 13446 (Run 6351, Clone 17, Gen 0)
22:20:16:WU00:FS00:0x22:Unit: 0x00000000000000000000000000000000
22:20:16:WU00:FS00:0x22:Reading tar file core.xml
22:20:16:WU00:FS00:0x22:Reading tar file integrator.xml.bz2
22:20:16:WU00:FS00:0x22:Reading tar file state.xml.bz2
22:20:16:WU00:FS00:0x22:Reading tar file system.xml.bz2
22:20:16:WU00:FS00:0x22:Digital signatures verified
22:20:16:WU00:FS00:0x22:Folding@home GPU Core22 Folding@home Core
22:20:16:WU00:FS00:0x22:Version 0.0.13
22:20:17:WU00:FS00:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
22:20:17:WU00:FS00:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
22:20:17:WU00:FS00:0x22:  XTC frame write interval: 250000 steps (25%) [4 total]
22:20:17:WU00:FS00:0x22:  Global context and integrator variables write interval: 25000 steps (2.5%) [40 total]
22:20:17:WU00:FS00:0x22:There are 3 platforms available.
22:20:17:WU00:FS00:0x22:Platform 0: Reference
22:20:17:WU00:FS00:0x22:Platform 1: CPU
22:20:17:WU00:FS00:0x22:Platform 2: OpenCL
22:20:17:WU00:FS00:0x22:  opencl-device 0 specified
22:20:34:WU00:FS00:0x22:Attempting to create OpenCL context:
22:20:34:WU00:FS00:0x22:  Configuring platform OpenCL
22:20:42:WU00:FS00:0x22:Failed to create OpenCL context:
22:20:42:WU00:FS00:0x22:Error compiling kernel: "C:\Users\admin\AppData\Local\Temp\OCL6916T24.cl", line 21: warning: OpenCL
22:20:42:WU00:FS00:0x22:          extension is now part of core
22:20:42:WU00:FS00:0x22:  #pragma OPENCL EXTENSION cl_khr_fp64 : enable
22:20:42:WU00:FS00:0x22:                           ^
22:20:42:WU00:FS00:0x22:
22:20:42:WU00:FS00:0x22:Error:E010:Irreducible ControlFlow Detected
22:20:42:WU00:FS00:0x22:ERROR:125: Failed to create a GPU-enabled OpenMM Context.
22:20:42:WU00:FS00:0x22:Saving result file ..\logfile_01.txt
22:20:42:WU00:FS00:0x22:Saving result file science.log
22:20:42:WU00:FS00:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
22:20:42:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
22:20:42:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:13446 run:6351 clone:17 gen:0 core:0x22 unit:0x000000110000000000003486000018cf
22:20:42:WU00:FS00:Uploading 2.82KiB to 54.157.202.86
22:20:42:WU00:FS00:Connecting to 54.157.202.86:8080
22:20:43:WU00:FS00:Upload complete
22:20:43:WU00:FS00:Server responded WORK_ACK (400)
22:20:43:WU00:FS00:Cleaning up
jchodera commented 3 years ago

@peastman Did we ever figure out where this is coming from? I'm still seeing a ton of this on Folding@home.

peastman commented 3 years ago

Not that I know of. I gave some suggestions above on how we could begin tracking it down.

bdenhollander commented 1 year ago

Double precision FP was an extension to OpenCL 1.0 and 1.1. It became an optional part of OpenCL 1.2 but the the extension was kept for backwards compatibility. Alternatively, clGetDeviceInfo can be used to check that CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE is greater than 0 to confirm a device supports double precision FP. Listing cl_khr_fp64 in CL_DEVICE_EXTENSIONS is still required in OpenCL 3.0 (pg. 77) so it will continue to be valid as a check for double precision.

An overzealous driver was probably to blame for throwing a warning when cl_khr_fp64 was explicitly enabled on OpenCL 1.2+. https://github.com/openmm/openmm/blob/76520ce48ffcc667eec088f5d292ef6ca238353e/platforms/opencl/src/OpenCLContext.cpp#L606-L607 Wrapping this pragma inside an OpenCL version check may avoid having the issue reappear.

peastman commented 1 year ago

A PR removing the pragma would be welcome! There's no need for a version check. We don't support versions earlier than 1.2 anymore.