Closed inducer closed 3 years ago
I just ran that test over and over on dunkel (with 5525398), while simulating a resource-constrained environment:
while true; do rm-vis; POCL_MAX_PTHREAD_COUNT=4 PYOPENCL_TEST=port numactl -N 0,1 python ./laplace-dirichlet-3d.py || break ; done
It ran around 100 times, without ever encountering the issue.
Ran the same command on a conda installation (which should be the same as what the Github CIs run?). It ran for about seven hours without crashing once.
In #61, I'm trying to use the (very slick-looking, actually) tmate (actually, their Github Action) to try and get a shell on the runner, to see what's happening.
I'm now running
i=0; while true; do i=$((i+1)); echo "RUN NUMBER $i";rm -f *vtu *vts; PYOPENCL_TEST=port python ./laplace-dirichlet-3d.py || break ; done
interactively on the Azure machine behind Github actions, and the test has run successfully ~10 times so far. (which doesn't match the stats otherwise) Maybe this is somehow dependent on the processor type? This one is a Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
.
ci-support will now log the CPU type: https://github.com/inducer/ci-support/commit/610dcc4376ce290fb17f851bc21eba464ce27205
OK, so much for the theory of being system-dependent.
I think I've found a way to reproduce the problem somewhat reliably. Run the examples build on #61 and then
bash # start a new shell so that the tmux window doesn't close when an error occurs
(export PYOPENCL_TEST=port:pthread; source .miniforge3/bin/activate testing ; source ci-support.sh ; run_examples)
Interesting: Do the same thing multiple times, get different results:
diff --git a/examples/laplace-dirichlet-3d.py b/examples/laplace-dirichlet-3d.py
index 48a737a..865b1c3 100644
--- a/examples/laplace-dirichlet-3d.py
+++ b/examples/laplace-dirichlet-3d.py
@@ -130,9 +130,20 @@ def main(mesh_name="torus", visualize=False):
bvp_rhs = bind(places, sqrt_w*sym.var("bc"))(actx, bc=bc)
+ spop = bound_op.scipy_op(actx, "sigma", dtype=np.float64)
+
+ from meshmode.dof_array import flat_norm
+ for i in range(10):
+ res = spop.matvec(bvp_rhs)
+ print(i, actx.np.linalg.norm(res))
+ 1/0
+
+
+
+
from pytential.solve import gmres
gmres_result = gmres(
- bound_op.scipy_op(actx, "sigma", dtype=np.float64),
+ spop,
bvp_rhs, tol=1e-14, progress=True,
stall_iterations=0,
hard_failure=True)
gives
0 18271.189240367086
1 18266.885665550613
2 18284.471850638434
3 18289.173559412557
4 18289.142404135426
5 18284.471850638434
6 18284.471850638434
7 18306.65272297942
8 18306.65272297942
9 18290.04129860244
For context: this prints the norm of what should be one and the same vector.
This diff still yields different matvec results each time the FMM is run:
diff --git a/examples/laplace-dirichlet-3d.py b/examples/laplace-dirichlet-3d.py
index 48a737a..3148440 100644
--- a/examples/laplace-dirichlet-3d.py
+++ b/examples/laplace-dirichlet-3d.py
@@ -130,9 +130,20 @@ def main(mesh_name="torus", visualize=False):
bvp_rhs = bind(places, sqrt_w*sym.var("bc"))(actx, bc=bc)
+ spop = bound_op.scipy_op(actx, "sigma", dtype=np.float64)
+
+ from meshmode.dof_array import flat_norm
+ for i in range(10):
+ res = spop.matvec(bvp_rhs)
+ print(i, actx.np.linalg.norm(bvp_rhs), actx.np.linalg.norm(res))
+ 1/0
+
+
+
+
from pytential.solve import gmres
gmres_result = gmres(
- bound_op.scipy_op(actx, "sigma", dtype=np.float64),
+ spop,
bvp_rhs, tol=1e-14, progress=True,
stall_iterations=0,
hard_failure=True)
diff --git a/pytential/qbx/fmm.py b/pytential/qbx/fmm.py
index a58524b..4118a22 100644
--- a/pytential/qbx/fmm.py
+++ b/pytential/qbx/fmm.py
@@ -543,23 +543,23 @@ def drive_fmm(expansion_wrangler, src_weight_vecs, timing_data=None,
recorder.add("translate_box_multipoles_to_qbx_local", timing_future)
- qbx_expansions = qbx_expansions + local_result
+ qbx_expansions = local_result
local_result, timing_future = (
wrangler.translate_box_local_to_qbx_local(local_exps))
recorder.add("translate_box_local_to_qbx_local", timing_future)
- qbx_expansions = qbx_expansions + local_result
+ #qbx_expansions = local_result
qbx_potentials, timing_future = wrangler.eval_qbx_expansions(qbx_expansions)
recorder.add("eval_qbx_expansions", timing_future)
- ts_result, timing_future = \
- wrangler.eval_target_specific_qbx_locals(src_weight_vecs)
+ #ts_result, timing_future = \
+ # wrangler.eval_target_specific_qbx_locals(src_weight_vecs)
- qbx_potentials = qbx_potentials + ts_result
+ #qbx_potentials = qbx_potentials + ts_result
recorder.add("eval_target_specific_qbx_locals", timing_future)
This suggests that m2qbxl may have a role to play in this. The results of the remainder of the FMM don't seem to "wobble".
It's the sumpy backend that has the issue. Both the single- and double-layer kernels exhibit the problem.
diff --git a/examples/laplace-dirichlet-3d.py b/examples/laplace-dirichlet-3d.py
index 48a737a..1b57a80 100644
--- a/examples/laplace-dirichlet-3d.py
+++ b/examples/laplace-dirichlet-3d.py
@@ -102,7 +102,7 @@ def main(mesh_name="torus", visualize=False):
bdry_op_sym = (loc_sign*0.5*sigma_sym
+ sqrt_w*(
sym.S(kernel, inv_sqrt_w_sigma, qbx_forced_limit=+1)
- + sym.D(kernel, inv_sqrt_w_sigma, qbx_forced_limit="avg")
+ #sym.D(kernel, inv_sqrt_w_sigma, qbx_forced_limit="avg")
))
# }}}
@@ -130,9 +130,20 @@ def main(mesh_name="torus", visualize=False):
bvp_rhs = bind(places, sqrt_w*sym.var("bc"))(actx, bc=bc)
+ spop = bound_op.scipy_op(actx, "sigma", dtype=np.float64)
+
+ from meshmode.dof_array import flat_norm
+ for i in range(10):
+ res = spop.matvec(bvp_rhs)
+ print(i, actx.np.linalg.norm(bvp_rhs), actx.np.linalg.norm(res))
+ 1/0
+
+
+
+
from pytential.solve import gmres
gmres_result = gmres(
- bound_op.scipy_op(actx, "sigma", dtype=np.float64),
+ spop,
bvp_rhs, tol=1e-14, progress=True,
stall_iterations=0,
hard_failure=True)
diff --git a/pytential/qbx/fmm.py b/pytential/qbx/fmm.py
index a58524b..63967a8 100644
--- a/pytential/qbx/fmm.py
+++ b/pytential/qbx/fmm.py
@@ -284,6 +284,7 @@ non_qbx_box_target_lists`),
events.append(evt)
wait_for = [evt]
assert qbx_expansions_res is qbx_expansions
+ print(isrc_level, np.sum(qbx_expansions.get()))
qbx_expansions.add_event(evt)
along with the changes above restricting qbx_expansions
to just M2QBXL gives
0 0.0
1 0.0
2 0.0
3 250517.75028195637
4 2515898.6674050163
0 15.543459679529555 12101.463403771833
0 0.0
1 0.0
2 0.0
3 250517.75028195637
4 2492680.9811040373
1 15.543459679529555 12073.09081650059
0 0.0
1 0.0
2 0.0
3 250517.75028195637
4 2515898.6674050163
2 15.543459679529555 12183.17744099083
0 0.0
1 0.0
2 0.0
3 250517.75028195637
4 2492680.9811040373
3 15.543459679529555 12073.09081650059
Two observations:
qbx_expansions
appear stable at level 3, level 4 "wobbles".qbx_expansions
, the norm of the result vector can still have additional "wobbles".Found something! Turns out the generated code for m2qbxl contains this gem:
for (int icenter_inner = 0; icenter_inner <= (-16 + ncenters + -16 * gid(0) >= 0 ? 15 : -1 + ncenters + -16 * gid(0)); ++icenter_inner)
{
if (icontaining_tgt_box != -1)
{
acc_isrc_box = 0.0;
acc_isrc_box_12 = 0.0;
acc_isrc_box_4 = 0.0;
acc_isrc_box_0 = 0.0;
}
icontaining_tgt_box = qbx_center_to_target_box_source_level[16 * gid(0) + icenter_inner];
if (icontaining_tgt_box != -1)
{
acc_isrc_box_3 = 0.0;
acc_isrc_box_9 = 0.0;
acc_isrc_box_21 = 0.0;
isrc_stop = src_box_starts[icontaining_tgt_box + 1];
tgt_rscale = qbx_expansion_radii[16 * gid(0) + icenter_inner];
isrc_start = src_box_starts[icontaining_tgt_box];
acc_isrc_box_19 = 0.0;
Observe how icontaining_tgt_box
is being used in the conditional before it is even being set :facepalm:.
It's almost hard to understand how this code worked for anyone, anywhere... In addition to fixing this code generation, we also need to make it harder to use loopy in a way that ends up with broken code like this.
https://github.com/inducer/loopy/pull/235 should address this.
Like this:
I've only observed this on Github, not Gitlab. This may indicate that it's less likely to happen when more cores are available, i.e. possibly some synchronization thing.
Here's a run that exhibits this: https://github.com/inducer/pytential/runs/1832875842?check_suite_focus=true
As far as I remember, this has been happening for a while. I think I have an email from Jan 16 where this occurred, potentially even earlier. Delinquently, I've often just rerun the CI job which typically resolved the failure, until it would reoccur on a subsequent run.
The recent failures under discussion in #56 are distinct IMO.