inducer / sumpy

Symbolic code generators for multipole and local expansions and translations
31 stars 13 forks source link

Loads of issues resulting from use of VkFFT #129

Open inducer opened 2 years ago

inducer commented 2 years ago

Issues showed up after #114:

cc @isuruf

inducer commented 2 years ago

Pytential's Mac CI on main is failing now, too: https://github.com/inducer/pytential/runs/7694158164?check_suite_focus=true#step:3:1057

A number of @alexfikl's PRs had picked up similar failures, and this establishes that the issue is not specific to them.

I conjecture that this is connected to the #114 merge. To help back up that conjecture, here's a run without that merge: https://github.com/inducer/pytential/pull/175. If there are no failures there, then that is further evidence that #114 is causing problems.

The CI runs don't necessarily back up the slowness claim.

At least the overall CI times are relatively similar. @alexfikl, could you provide a way to reproduce your Stokes run that got substantially slower?

The missing barrier is something I encountered in work with @rckirby. I still have to work out what's happening, I will report back here.

inducer commented 2 years ago

The first run of https://github.com/inducer/pytential/pull/175 passed. I've just started another run as an additional data point.

inducer commented 2 years ago

Second run passed as well.

alexfikl commented 2 years ago

Just to add more information here from my side.

First, for the slowdown, I noticed that on the 3D Stokes operator from inducer/pytential#29 (slightly modified and merged with main). That uses the Laplace formulation from https://github.com/isuruf/pytential/blob/feec35e928767dd6add0cba55883cd15bed93445/pytential/symbolic/stokes.py#L578-L639 I'll try to reproduce something and come back with some numbers.

Then, for the warnings on sumpy, they were

/mnt/data/code/inducer/loopy/loopy/kernel/creation.py:1909: LoopyWarning: in kernel m2l_generate_translation_classes_dependent_data: The single-writer dependency heuristic added dependencies on instruction ID(s) 'set_d' to instruction ID 'm2l__insn_2' after kernel creation is complete. This is deprecated and may stop working in the future. To fix this, ensure that instruction dependencies are added/resolved as soon as possible, ideally at kernel creation time. (add 'single_writer_after_creation' to silenced_warnings kernel attribute to disable)

with different m2l__insn_XX in there. From what I can tell, that comes from here somewhere https://github.com/inducer/sumpy/blob/4ace8ea35283db5f9ba8f038ba12f763086e44e2/sumpy/e2e.py#L560-L565 but the dependencies seem correctly declared at least.

alexfikl commented 2 years ago

For a reproducer of my slowness claim, this should work https://gist.github.com/alexfikl/d16750b2407ccbc064151976c5473c53 It just evaluates the 3D Stokeslet using the Laplace kernel (Tornberg).

EDIT: Just started running in pre-114 and it seems to be equally slow. I'll post the numbers once it finishes :cry:

EDIT2: I take that back, something definitely seems off. Pre-114 it was giving

elapsed: 2094.80s wall 0.91x CPU
elapsed: 1917.34s wall 1.01x CPU

so about 30min to compile and evaluate the Stokeslet. But now it's over an hour. I'm running this on dunkel with everything installed through conda.

EDIT3: Ok, something is very fishy, just ran that script on main and

elapsed: 23082.66s wall 1.00x CPU

which is somewhere around 6h! Let me know if you get a change to take a look or trying it out, maybe there's something silly in there :\

isuruf commented 2 years ago

Warnings about write races are false positives. See https://github.com/inducer/loopy/issues/564

isuruf commented 2 years ago

macos failure is a segfault in pocl.

inducer commented 2 years ago

Are you able to get a backtrace? Is it in the pocl runtime or in a kernel?

isuruf commented 2 years ago
(lldb) bt
* thread #4, stop reason = EXC_BAD_ACCESS (code=2, address=0x70000ff87240)
  * frame #0: 0x000000014f69227e VkFFT_main.so`_pocl_kernel_VkFFT_main_workgroup + 7886
    frame #1: 0x0000000136ce7c47 libpocl-devices-pthread.so`work_group_scheduler + 727
    frame #2: 0x0000000136ce723e libpocl-devices-pthread.so`pocl_pthread_driver_thread + 334
    frame #3: 0x00007fff5137c661 libsystem_pthread.dylib`_pthread_body + 340
    frame #4: 0x00007fff5137c50d libsystem_pthread.dylib`_pthread_start + 377
    frame #5: 0x00007fff5137bbf9 libsystem_pthread.dylib`thread_start + 13
inducer commented 2 years ago

So does pocl miscompile vkfft?

inducer commented 2 years ago

Is there an easy way to reproduce this? Will it reproduce on the CEESD M1? Should we start thinking about reverting #114 for now?

inducer commented 2 years ago

Or maybe turn off vkfft on Mac?

isuruf commented 2 years ago

I can reproduce on appletini.

isuruf commented 2 years ago

Running the test with oclgrind passes and shows no out of bounds accesses.

isuruf commented 2 years ago

VkFFT_main.so in the backtrace is the OpenCL kernel compiled to a binary by pocl.

isuruf commented 2 years ago

There's a warning just before the segfault due to a bad access, but don't think they are related.

POCL: in fn clEnqueueNDRangeKernel at line 282:
  |   WARNING |  readonly buffer used as kernel arg, but arg type is not const
isuruf commented 2 years ago

Using an unoptimized loopy kernel instead of vkfft makes the test pass. See https://github.com/inducer/sumpy/pull/130

inducer commented 2 years ago

Thanks! Does this reproduce when calling pyvkfft on its own? I am asking with an eye towards potentially reporting this upstream to pocl

isuruf commented 2 years ago

Yes, running the example https://github.com/vincefn/pyvkfft/blob/master/examples/opencl-test.py segfaults.

isuruf commented 2 years ago

~Running the example as a script doesn't reproduce the issue, but opening up a jupyter terminal and running the code does. (No idea why)~

inducer commented 2 years ago

@hirish99 reported another VkFFT issue occurring on a Mac: https://gist.github.com/hirish99/16d2888092595283b0f698bf5d8106c0

I do not know whether this is Apple silicon or not.

hirish99 commented 2 years ago

When I try to run helmholtz-dirichlet.py on my Mac (I don't know how much of this matters) running MacOS Catalina Version 10.15.7, Processor: 2.6 GHz 6-Core Intel Core i7, Memory: 16 GB 2400 MHz DDR4, Graphics: Radeon Pro 560X 4 GB Intel UHD Graphics 630 1536 MB. Gist: https://gist.github.com/hirish99/16d2888092595283b0f698bf5d8106c0

inducer commented 2 years ago

Quite possibly this is another possible symptom of the presumed miscompilation described in https://github.com/pocl/pocl/issues/1084.

isuruf commented 2 years ago

@hirish99, can you try running https://gist.github.com/isuruf/17f6b210cf4cf8c8b103c18e155e00d6?

hirish99 commented 2 years ago

Yes, https://gist.github.com/hirish99/622813be95fcc1aeee65ce1fac22e101

isuruf commented 2 years ago

Thanks. What do you get when you run ./a.out; echo $?

hirish99 commented 2 years ago

I have not modified sumpy yet btw, I updated the gist to show the output of ./a.out; echo. $

isuruf commented 2 years ago

Thanks. Can you try the updated program at https://gist.github.com/isuruf/17f6b210cf4cf8c8b103c18e155e00d6?

hirish99 commented 2 years ago

https://gist.github.com/hirish99/44f74cdb800b385f811523c881aed088

inducer commented 2 years ago

@hirish99 #130 just landed. Could you check whether this resolves the issue you ran into?

inducer commented 2 years ago

Pytential picked up a failure as well on Gitlab CI: https://gitlab.tiker.net/inducer/sumpy/-/jobs/443459

Edit: That seems to be intermittent. https://gitlab.tiker.net/inducer/sumpy/-/pipelines/324267

inducer commented 2 years ago

@alexfikl #132 avoids at least one instance of lengthy checks during Loopy "scheduling" for me. How does it do on the slowness you report?

alexfikl commented 2 years ago

@alexfikl #132 avoids at least one instance of lengthy checks during Loopy "scheduling" for me. How does it do on the slowness you report?

Can confirm that that works around it nicely for that Stokeslet case! Just ran #132 compared to fd355eb (before pyvkfft) with the script from https://github.com/inducer/sumpy/issues/129#issuecomment-1207257972 and got

fd355eb: elapsed: 3351.24s wall 1.04x CPU
132:     elapsed: 728.44s wall 1.02x CPU

So that seems to be even faster than before.

inducer commented 2 years ago

133 could help further.

FWIW, #132 and #133 are workarounds more than solutions. Instead, we should fix the code generation to be in the style of #131.

alexfikl commented 2 years ago

133 could help further.

Yep, ran the same benchmark script and got

133:    elapsed: 427.10s wall 0.98x CPU
inducer commented 2 years ago

Between #132 and #131, I think we've more or less dealt with the slowness, I think. I've checked off that issue up top.

inducer commented 2 years ago

That leaves the intermittent VkFFT errors. I'm still a bit lost there. @isuruf mentioned he also can't seem to reproduce them.

isuruf commented 2 years ago

macOS errors are gone because we don't use VkFFT for macOS.

As to the VkFFT OpenCL compile failure in https://gitlab.tiker.net/inducer/sumpy/-/jobs/443459, I've tried many times and failed to reproduce. Only explanation I can think of is that the machine was particularly busy and the OS killed the pocl process compiling the OpenCL kernel.

inducer commented 2 years ago

Let's see how often they recur. I feel like I've seen them more frequently than could be explained by heavy machine load.

alexfikl commented 2 years ago

That pipeline was running on dunkel from what I can tell. I've been running my Stokes stuff there too with the latest main and haven't seen any crashes at all, so no idea what could be causing them..