Closed asford closed 6 years ago
@aleaverfay Any thoughts? This seems like a stream/sync issue between torch and numba. Did you look into the stream allocation logic at all?
Interesting. Why exactly do you believe this is a stream/sync issue?
I can't reproduce it here.
A tentative solution would be to ask torch for the current stream, and then synchronize it before launching the numba kernel. If you don't provide a stream to numba when launching a kernel, then its kernel executions are synchronous.
if not inplace:
f1f2s_kintree_ordering = f1f2s_kintree_ordering.clone()
natoms = len(f1f2s_kintree_ordering)
assert natoms == len(self.dsi2ki)
torch.cuda.current_stream().synchronize() # *** new code here ***
derivsum_jit.F1F2Scan.segscan_by_generation(
64,
as_cuda_array(f1f2s_kintree_ordering),
self.dsi2ki_d,
self.is_leaf_d,
self.nonpath_children_d,
self.atom_range_for_depth_d,
)
what would be better would be to to create a numba stream from the Torch stream, and then launch the refold operation in the torch stream. This would help us hide numba's high kernel-launch overhead by launching the numba kernel before the preceding torch operations had completed. There's certainly no documented way to do that. I'll poke around the pytorch source and see if anything jumps out.
Interestingly, the code runs if you just pass torch's current_stream into the numba kernel launch. Can you give it a try on your machine and see if the intermittent failure goes away?
diff --git a/tmol/kinematics/gpu_operations/scan_jit.py b/tmol/kinematics/gpu_operations/scan_jit.py
index 6fcf154..b1ddde5 100644
--- a/tmol/kinematics/gpu_operations/scan_jit.py
+++ b/tmol/kinematics/gpu_operations/scan_jit.py
@@ -1,7 +1,7 @@
import math
import numba
import numba.cuda as cuda
-
+import torch
class GenerationalSegmentedScan:
"""Factory class for cuda-based generational segmented scan operations.
@@ -83,7 +83,8 @@ class GenerationalSegmentedScan:
non_path_inputs, #[n, max_num_inputs]
generation_ranges, #[g, 2]
):
- cls.get_kernel(threads_per_block)[1, threads_per_block](
+ cls.get_kernel(threads_per_block)[1, threads_per_block,
+ torch.cuda.current_stream().cuda_stream](
src_vals,
scan_to_src_ordering,
is_path_root,
Unfortunately, I don't think this is a stream ordering issue. If you look through the nvprof
trace of the repro it appears that everything is launching on the same stream (7).
I'm just about worked out a minimal repro as a test. It appears that the result of the derivsum pass is inconsistent over repeat invocations. I'll push the test with evidence for this shortly.
I've narrowed down the issue a bit more in in the two commits above.
The derivsum scan appears to return inconsistent results in an input-data and warp-count dependent manner. The issues only occurs when multiple warps are presents (ie. >32 threads per block) and is input-data dependent.
This test failure may be a result of my misunderstanding of the f1f2 derivative calculation. The error occurs when the dsc_dx
values for summation are randomly initialized over the range [-1, 1). It does not occur when values are initialized to one or a "modded arange" between [0, 100):
The failures now reproduce consistently in the buildkite run: https://buildkite.com/uw-ipd/tmol/builds/371#af3694a4-4f4d-452e-8f77-4e626c46cb63/291-487
@aleaverfay Any comments on the validity of testing with wacky derivatives of that type? This seems like it could be linked to warp ordering in the scan?
@aleaverfay Any comments on the validity of testing with wacky derivatives of that type? This seems like it could be linked to warp ordering in the scan?
dsc_dx can take any value so it should be well behaved there.
Can you give me access to the machine where you're able to see this bug? I can't reproduce it here.
This is running on mako
, which is in the digs. Do you have access?
Those new tests are all passing in your env? Can you run
.buildkite/bin/testing
on your box and post the full log here? I'm really
curious about why this could be environment specific.
On Wed, Jul 18, 2018 at 11:32 AM Andrew Leaver-Fay notifications@github.com wrote:
Can you give me access to the machine where you're able to see this bug. I can't reproduce it here.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uw-ipd/tmol/issues/90#issuecomment-406030600, or mute the thread https://github.com/notifications/unsubscribe-auth/AARQqDOap6IRTAqoCyepYLA5DCaZGAlGks5uH38KgaJpZM4VTu4l .
--
Alex Ford Baker Group, Biomolecular Structure and Design University of Washington fordas@uw.edu 206.659.6559
d529563bafbb789c1c0ed1d157b668231d44073c.psgcluster.testing.log
Seeing no failures on the V100 at psgcluster:
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[32-random] PASSED [ 8%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[32-arange_mod] PASSED [ 9%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[32-ones] PASSED [ 9%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[64-random] PASSED [ 10%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[64-arange_mod] PASSED [ 10%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[64-ones] PASSED [ 11%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[256-random] PASSED [ 11%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[256-arange_mod] PASSED [ 12%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[256-ones] PASSED [ 13%]
tmol/tests/kinematics/test_gpu_operations.py::test_refold_consistency[32] PASSED [ 13%]
tmol/tests/kinematics/test_gpu_operations.py::test_refold_consistency[64] PASSED [ 14%]
tmol/tests/kinematics/test_gpu_operations.py::test_refold_consistency[256] PASSED [ 14%]
--- nvidia-smi
Wed Jul 18 13:59:08 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26 Driver Version: 396.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:05:00.0 Off | 0 |
| N/A 32C P0 24W / 250W | 4MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... On | 00000000:06:00.0 Off | 0 |
| N/A 32C P0 25W / 250W | 0MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... On | 00000000:84:00.0 Off | 0 |
| N/A 29C P0 24W / 250W | 0MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... On | 00000000:85:00.0 Off | 0 |
| N/A 31C P0 24W / 250W | 0MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
I'm pretty much convinced that this issue is, somehow, linked to the 1050 hardware that we're using in the current dev box. I've rerun on a panel of different cards and haven't been able to reproduce the failure.
@aleaverfay I think we can probably let this one rest for the time being. We'll plan on moving the CI server off the 1050s asap. We are currently looking into pulling a few 1080 TIs out the digs queue to dedicate to this as a dev/test box.
Tested platforms:
hsw_v100_16gb
hsw_p100
ivb_k20
ivb_k40
ivb_k80
ivb_m40
ivb_m60
(Reposted from slack) @asford OK -- so I have 100% narrowed down the problem to a particular write/read to shared memory, and if I rearrange how that is done or add synchronization surrounding that write/read, I fix it. The test case is working. What I don't have, however, is an understanding as to why the synchronization is necessary. here's the git diff:
diff --git a/tmol/kinematics/gpu_operations/scan_jit.py b/tmol/kinematics/gpu_operations/scan_jit.py
index 6fcf154..caeecfa 100644
--- a/tmol/kinematics/gpu_operations/scan_jit.py
+++ b/tmol/kinematics/gpu_operations/scan_jit.py
@@ -163,7 +163,9 @@ class GenerationalSegmentedScan:
### Read node values from global into shared
my_val = load(src_vals, ii_src)
- shared_is_root[pos] = is_path_root[ii_ind]
+ ### version 1: my_root = is_path_root[ii_ind]
+ ### version 1: shared_is_root[pos] = my_root
+ shared_is_root[pos] = is_path_root[ii_ind] #broken
### Sum incoming scan value from parent into node
# parent only set if node is root of scan
@@ -177,8 +179,12 @@ class GenerationalSegmentedScan:
), my_val
)
+ ### version 2: cuda.syncthreads()
+ ### version 2: if ii_ind < end:
+
### Sum carry value from previous block if node 0 is non-root.
- my_root = shared_is_root[pos]
+ my_root = shared_is_root[pos] #broken
+ ## version 2: my_root = shared_is_root[pos]
if pos == 0 and not my_root:
my_val = add(carry_val, my_val)
my_root |= carry_is_root
if you comment out the two lines marked broken
, and then uncomment either version1 or version2 you will get working code
so here's the thing I don't get, and that is worrisome
it doesn't feel like writing to shared_is_root[pos] and then later reading from shared_is_root[pos] without a synchronization event in between should be an issue
the only thing that is a tiny bit weird is that the code (the next line after the git diff) writes to shared_is_root[0]
instead of shared_is_root[pos]
but that should not be problematic because the if
statement has just assured us that pos == 0
furthermore (and this is getting deep into the weeds) I am 99% certain that the carry_is_root
variable is unnecessary and that final write to shared_is_root[0]
is also unnecessary -- that these values cannot have an impact on the behavior of scan. I added them because it felt like I might need to carry these results forward, but I'm pretty sure I can axe them (this is something I can test)
... (then I went and tested it)...
right -- neither carry_is_root
nor the second write to shared_is_root[0]
is necessary. After thread 0 has accumulated the carry_val
, it is done since pos >= offset
is false for all values of offset when pos == 0
.
but removing the second write to shared_is_root[0]
still does not fix the synchronization bug
I'm looking around for advice such as "don't have one thread read from shared memory while another thread is writing to shared memory, even if they're reading/writing to different positions" -- advice I've never encountered when learning about shared memory
PS: 1) This feels like a compiler bug or, as you seem to be suspecting, a hardware bug,
2) Perhaps also worth saying: version 1 above minimizes the number of shared memory transactions over the lines I've commented with #broken.
:thinking: I was able to most easily grok the changes in (version 1), but I'll definitely defer to whatever makes the most sense to you.
Should we take the changes currently in the "investigate_90" branch, add the v1 fix and open a PR?
I like version 1 even if it doesn't fix a bug. I think loading data into local memory and then into shared memory makes a bit more sense. (Version 2 as I described it was incomplete/incorrect, fwiw -- you need to keep the first of the two lines marked "broken", loading shared memory with the contents of global memory.)
Fixed in #91
Observed a test failure in one the test suite from the gpu refold kernel introduced in #75 in a build run for #88, https://buildkite.com/uw-ipd/tmol/builds/364#24cf7a7c-fb05-44a9-8567-9b87734c67aa.
This appears to be intermittent but can be reproduced on
mako
via: