Open inducer opened 2 years ago
Pytential's Mac CI on main
is failing now, too: https://github.com/inducer/pytential/runs/7694158164?check_suite_focus=true#step:3:1057
A number of @alexfikl's PRs had picked up similar failures, and this establishes that the issue is not specific to them.
I conjecture that this is connected to the #114 merge. To help back up that conjecture, here's a run without that merge: https://github.com/inducer/pytential/pull/175. If there are no failures there, then that is further evidence that #114 is causing problems.
The CI runs don't necessarily back up the slowness claim.
At least the overall CI times are relatively similar. @alexfikl, could you provide a way to reproduce your Stokes run that got substantially slower?
The missing barrier is something I encountered in work with @rckirby. I still have to work out what's happening, I will report back here.
The first run of https://github.com/inducer/pytential/pull/175 passed. I've just started another run as an additional data point.
Second run passed as well.
Just to add more information here from my side.
First, for the slowdown, I noticed that on the 3D Stokes operator from inducer/pytential#29 (slightly modified and merged with main
). That uses the Laplace formulation from
https://github.com/isuruf/pytential/blob/feec35e928767dd6add0cba55883cd15bed93445/pytential/symbolic/stokes.py#L578-L639
I'll try to reproduce something and come back with some numbers.
Then, for the warnings on sumpy
, they were
/mnt/data/code/inducer/loopy/loopy/kernel/creation.py:1909: LoopyWarning: in kernel m2l_generate_translation_classes_dependent_data: The single-writer dependency heuristic added dependencies on instruction ID(s) 'set_d' to instruction ID 'm2l__insn_2' after kernel creation is complete. This is deprecated and may stop working in the future. To fix this, ensure that instruction dependencies are added/resolved as soon as possible, ideally at kernel creation time. (add 'single_writer_after_creation' to silenced_warnings kernel attribute to disable)
with different m2l__insn_XX
in there. From what I can tell, that comes from here somewhere
https://github.com/inducer/sumpy/blob/4ace8ea35283db5f9ba8f038ba12f763086e44e2/sumpy/e2e.py#L560-L565
but the dependencies seem correctly declared at least.
For a reproducer of my slowness claim, this should work https://gist.github.com/alexfikl/d16750b2407ccbc064151976c5473c53 It just evaluates the 3D Stokeslet using the Laplace kernel (Tornberg).
EDIT: Just started running in pre-114 and it seems to be equally slow. I'll post the numbers once it finishes :cry:
EDIT2: I take that back, something definitely seems off. Pre-114 it was giving
elapsed: 2094.80s wall 0.91x CPU
elapsed: 1917.34s wall 1.01x CPU
so about 30min to compile and evaluate the Stokeslet. But now it's over an hour. I'm running this on dunkel
with everything installed through conda
.
EDIT3: Ok, something is very fishy, just ran that script on main
and
elapsed: 23082.66s wall 1.00x CPU
which is somewhere around 6h! Let me know if you get a change to take a look or trying it out, maybe there's something silly in there :\
Warnings about write races are false positives. See https://github.com/inducer/loopy/issues/564
macos failure is a segfault in pocl.
Are you able to get a backtrace? Is it in the pocl runtime or in a kernel?
(lldb) bt
* thread #4, stop reason = EXC_BAD_ACCESS (code=2, address=0x70000ff87240)
* frame #0: 0x000000014f69227e VkFFT_main.so`_pocl_kernel_VkFFT_main_workgroup + 7886
frame #1: 0x0000000136ce7c47 libpocl-devices-pthread.so`work_group_scheduler + 727
frame #2: 0x0000000136ce723e libpocl-devices-pthread.so`pocl_pthread_driver_thread + 334
frame #3: 0x00007fff5137c661 libsystem_pthread.dylib`_pthread_body + 340
frame #4: 0x00007fff5137c50d libsystem_pthread.dylib`_pthread_start + 377
frame #5: 0x00007fff5137bbf9 libsystem_pthread.dylib`thread_start + 13
So does pocl miscompile vkfft?
Is there an easy way to reproduce this? Will it reproduce on the CEESD M1? Should we start thinking about reverting #114 for now?
Or maybe turn off vkfft on Mac?
I can reproduce on appletini.
Running the test with oclgrind
passes and shows no out of bounds accesses.
VkFFT_main.so
in the backtrace is the OpenCL kernel compiled to a binary by pocl.
There's a warning just before the segfault due to a bad access, but don't think they are related.
POCL: in fn clEnqueueNDRangeKernel at line 282:
| WARNING | readonly buffer used as kernel arg, but arg type is not const
Using an unoptimized loopy kernel instead of vkfft makes the test pass. See https://github.com/inducer/sumpy/pull/130
Thanks! Does this reproduce when calling pyvkfft on its own? I am asking with an eye towards potentially reporting this upstream to pocl
Yes, running the example https://github.com/vincefn/pyvkfft/blob/master/examples/opencl-test.py segfaults.
~Running the example as a script doesn't reproduce the issue, but opening up a jupyter terminal and running the code does. (No idea why)~
@hirish99 reported another VkFFT issue occurring on a Mac: https://gist.github.com/hirish99/16d2888092595283b0f698bf5d8106c0
I do not know whether this is Apple silicon or not.
When I try to run helmholtz-dirichlet.py on my Mac (I don't know how much of this matters) running MacOS Catalina Version 10.15.7, Processor: 2.6 GHz 6-Core Intel Core i7, Memory: 16 GB 2400 MHz DDR4, Graphics: Radeon Pro 560X 4 GB Intel UHD Graphics 630 1536 MB. Gist: https://gist.github.com/hirish99/16d2888092595283b0f698bf5d8106c0
Quite possibly this is another possible symptom of the presumed miscompilation described in https://github.com/pocl/pocl/issues/1084.
@hirish99, can you try running https://gist.github.com/isuruf/17f6b210cf4cf8c8b103c18e155e00d6?
Thanks. What do you get when you run ./a.out; echo $?
I have not modified sumpy yet btw, I updated the gist to show the output of ./a.out; echo. $
Thanks. Can you try the updated program at https://gist.github.com/isuruf/17f6b210cf4cf8c8b103c18e155e00d6?
@hirish99 #130 just landed. Could you check whether this resolves the issue you ran into?
Pytential picked up a failure as well on Gitlab CI: https://gitlab.tiker.net/inducer/sumpy/-/jobs/443459
Edit: That seems to be intermittent. https://gitlab.tiker.net/inducer/sumpy/-/pipelines/324267
@alexfikl #132 avoids at least one instance of lengthy checks during Loopy "scheduling" for me. How does it do on the slowness you report?
@alexfikl #132 avoids at least one instance of lengthy checks during Loopy "scheduling" for me. How does it do on the slowness you report?
Can confirm that that works around it nicely for that Stokeslet case! Just ran #132 compared to fd355eb (before pyvkfft) with the script from https://github.com/inducer/sumpy/issues/129#issuecomment-1207257972 and got
fd355eb: elapsed: 3351.24s wall 1.04x CPU
132: elapsed: 728.44s wall 1.02x CPU
So that seems to be even faster than before.
FWIW, #132 and #133 are workarounds more than solutions. Instead, we should fix the code generation to be in the style of #131.
133 could help further.
Yep, ran the same benchmark script and got
133: elapsed: 427.10s wall 0.98x CPU
Between #132 and #131, I think we've more or less dealt with the slowness, I think. I've checked off that issue up top.
That leaves the intermittent VkFFT errors. I'm still a bit lost there. @isuruf mentioned he also can't seem to reproduce them.
macOS errors are gone because we don't use VkFFT for macOS.
As to the VkFFT OpenCL compile failure in https://gitlab.tiker.net/inducer/sumpy/-/jobs/443459, I've tried many times and failed to reproduce. Only explanation I can think of is that the machine was particularly busy and the OS killed the pocl process compiling the OpenCL kernel.
Let's see how often they recur. I feel like I've seen them more frequently than could be explained by heavy machine load.
That pipeline was running on dunkel
from what I can tell. I've been running my Stokes stuff there too with the latest main
and haven't seen any crashes at all, so no idea what could be causing them..
Issues showed up after #114:
cc @isuruf