Intermittent failure of derivsum operations.

asford commented 6 years ago

Observed a test failure in one the test suite from the gpu refold kernel introduced in #75 in a build run for #88, https://buildkite.com/uw-ipd/tmol/builds/364#24cf7a7c-fb05-44a9-8567-9b87734c67aa.

This appears to be intermittent but can be reproduced on mako via:

watch -e '(set -o pipefail && (pytest -v tmol/tests/kinematics/test_gpu_operations.py | tee testlog.txt))'

tmol/tests/kinematics/test_gpu_operations.py: test_parallel_and_iterative_derivsum
Message
AssertionError: 
Arrays are not almost equal to 6 decimals
(mismatch 20.553520553520556%)
x: array([ -3.463094, -54.177893, -23.248532, ...,   0.41905 ,   0.229492,
-0.928028])
y: array([-34.565543, -16.692453, -22.334696, ...,   0.41905 ,   0.229492,
-0.928028])

Details
benchmark = <pytest_benchmark.fixture.BenchmarkFixture object at 0x7fff76ee1d30>
target_system = PackedResidueSystem(block_size=8, residues=[Residue(residue_type=ResidueType(name='MET', name3='MET', atoms=(Atom(name...from_residue_index', '<i8'), ('from_connection_name', 'O'), ('to_residue_index', '<i8'), ('to_connection_name', 'O')]))
target_kintree = KinTree(id=tensor([   -1,     0,     1,  ...,  1454,  1455,  1467]), doftype=tensor([ 0,  1,  2,  ...,  2,  2,  2], dt...or([    0,     1,     1,  ...,  1220,  1220,  1221]), frame_z=tensor([    0,     3,     3,  ...,  1215,  1215,  1217]))
target_device = 'cuda'
@pytest.mark.benchmark(group=&quot;kinematic_op_micro&quot;)
def test_parallel_and_iterative_derivsum(
        benchmark, target_system, target_kintree, target_device
):
    coords = torch.tensor(target_system.coords[target_kintree.id]
                          ).to(device=target_device)
    kintree = target_kintree.to(device=target_device)

    torch.manual_seed(1663)
    dsc_dx = (torch.rand_like(coords) * 2) - 1

    f1s = torch.cross(coords, coords - dsc_dx)
    f2s = dsc_dx.clone()  # clone input buffer before aggregation

    f1f2s = torch.cat((f1s, f2s), 1)

    ### deriv summation sould be equivalent in both interative and parallel mode

    iterative_f1f2_sums = cpu_operations.iterative_f1f2_summation(
        f1f2s.cpu(), kintree.parent, inplace=False
    )

    # Load and cache ordering for benchmark
    derivsum_ordering = GPUKinTreeReordering.for_kintree(
        kintree
    ).derivsum_ordering

    @benchmark
    def parallel_f1f2_sums():
        result = derivsum_ordering.segscan_f1f2s(f1f2s, inplace=False)
        numba.cuda.synchronize()
        return result

    numpy.testing.assert_array_almost_equal(
        iterative_f1f2_sums, parallel_f1f2_sums
    )

    ### inplace operations should produce same result as non-inplace
    iterative_f1f2_sums_inplace = f1f2s.clone().cpu()
    cpu_operations.iterative_f1f2_summation(
        iterative_f1f2_sums_inplace, kintree.parent, inplace=True
    )
    numpy.testing.assert_array_almost_equal(
        iterative_f1f2_sums, iterative_f1f2_sums_inplace
    )

    parallel_f1f2_sums_inplace = f1f2s.clone()
    (
        GPUKinTreeReordering.for_kintree(kintree)
        .derivsum_ordering.segscan_f1f2s(
            parallel_f1f2_sums_inplace, inplace=True
        )
    )
    numpy.testing.assert_array_almost_equal(
        parallel_f1f2_sums,

>           parallel_f1f2_sums_inplace,
)
E       AssertionError:
E       Arrays are not almost equal to 6 decimals
E

E       (mismatch 20.553520553520556%)
E        x: array([ -3.463094, -54.177893, -23.248532, ...,   0.41905 ,   0.229492,
E               -0.928028])
E        y: array([-34.565543, -16.692453, -22.334696, ...,   0.41905 ,   0.229492,
E               -0.928028])

tmol/tests/kinematics/test_gpu_operations.py:225: AssertionError

asford commented 6 years ago

@aleaverfay Any thoughts? This seems like a stream/sync issue between torch and numba. Did you look into the stream allocation logic at all?

aleaverfay commented 6 years ago

Interesting. Why exactly do you believe this is a stream/sync issue?

aleaverfay commented 6 years ago

I can't reproduce it here.

A tentative solution would be to ask torch for the current stream, and then synchronize it before launching the numba kernel. If you don't provide a stream to numba when launching a kernel, then its kernel executions are synchronous.

        if not inplace:
            f1f2s_kintree_ordering = f1f2s_kintree_ordering.clone()

        natoms = len(f1f2s_kintree_ordering)
        assert natoms == len(self.dsi2ki)

        torch.cuda.current_stream().synchronize() # *** new code here ***
        derivsum_jit.F1F2Scan.segscan_by_generation(
            64,
            as_cuda_array(f1f2s_kintree_ordering),
            self.dsi2ki_d,
            self.is_leaf_d,
            self.nonpath_children_d,
            self.atom_range_for_depth_d,
        )

what would be better would be to to create a numba stream from the Torch stream, and then launch the refold operation in the torch stream. This would help us hide numba's high kernel-launch overhead by launching the numba kernel before the preceding torch operations had completed. There's certainly no documented way to do that. I'll poke around the pytorch source and see if anything jumps out.

aleaverfay commented 6 years ago

Here's the way the current_stream() is implemented. I don't know anything about the datatype returned by torch._C._cuda_getCurrentStream(). It would be nice if there were some way to create a numba stream from it. This gist has the ctor for the numba NDArray taking a torch stream, though.

aleaverfay commented 6 years ago

Interestingly, the code runs if you just pass torch's current_stream into the numba kernel launch. Can you give it a try on your machine and see if the intermittent failure goes away?

diff --git a/tmol/kinematics/gpu_operations/scan_jit.py b/tmol/kinematics/gpu_operations/scan_jit.py
index 6fcf154..b1ddde5 100644
--- a/tmol/kinematics/gpu_operations/scan_jit.py
+++ b/tmol/kinematics/gpu_operations/scan_jit.py
@@ -1,7 +1,7 @@
 import math
 import numba
 import numba.cuda as cuda
-
+import torch

 class GenerationalSegmentedScan:
     """Factory class for cuda-based generational segmented scan operations.
@@ -83,7 +83,8 @@ class GenerationalSegmentedScan:
             non_path_inputs,  #[n, max_num_inputs]
             generation_ranges,  #[g, 2]
     ):
-        cls.get_kernel(threads_per_block)[1, threads_per_block](
+        cls.get_kernel(threads_per_block)[1, threads_per_block,
+                                          torch.cuda.current_stream().cuda_stream](
             src_vals,
             scan_to_src_ordering,
             is_path_root,

asford commented 6 years ago

Unfortunately, I don't think this is a stream ordering issue. If you look through the nvprof trace of the repro it appears that everything is launching on the same stream (7).

I'm just about worked out a minimal repro as a test. It appears that the result of the derivsum pass is inconsistent over repeat invocations. I'll push the test with evidence for this shortly.

asford commented 6 years ago

I've narrowed down the issue a bit more in in the two commits above.

The derivsum scan appears to return inconsistent results in an input-data and warp-count dependent manner. The issues only occurs when multiple warps are presents (ie. >32 threads per block) and is input-data dependent.

This test failure may be a result of my misunderstanding of the f1f2 derivative calculation. The error occurs when the dsc_dx values for summation are randomly initialized over the range [-1, 1). It does not occur when values are initialized to one or a "modded arange" between [0, 100):

https://github.com/uw-ipd/tmol/blob/d529563bafbb789c1c0ed1d157b668231d44073c/tmol/tests/kinematics/test_gpu_operations.py#L262-L270

The failures now reproduce consistently in the buildkite run: https://buildkite.com/uw-ipd/tmol/builds/371#af3694a4-4f4d-452e-8f77-4e626c46cb63/291-487

asford commented 6 years ago

@aleaverfay Any comments on the validity of testing with wacky derivatives of that type? This seems like it could be linked to warp ordering in the scan?

fdimaio commented 6 years ago

@aleaverfay Any comments on the validity of testing with wacky derivatives of that type? This seems like it could be linked to warp ordering in the scan?

dsc_dx can take any value so it should be well behaved there.

aleaverfay commented 6 years ago

Can you give me access to the machine where you're able to see this bug? I can't reproduce it here.

asford commented 6 years ago

This is running on mako, which is in the digs. Do you have access?

Those new tests are all passing in your env? Can you run .buildkite/bin/testing on your box and post the full log here? I'm really curious about why this could be environment specific.

On Wed, Jul 18, 2018 at 11:32 AM Andrew Leaver-Fay notifications@github.com wrote:

Can you give me access to the machine where you're able to see this bug. I can't reproduce it here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uw-ipd/tmol/issues/90#issuecomment-406030600, or mute the thread https://github.com/notifications/unsubscribe-auth/AARQqDOap6IRTAqoCyepYLA5DCaZGAlGks5uH38KgaJpZM4VTu4l .

--

Alex Ford Baker Group, Biomolecular Structure and Design University of Washington fordas@uw.edu 206.659.6559

aleaverfay commented 6 years ago

d529563bafbb789c1c0ed1d157b668231d44073c.vcl.testing.log

asford commented 6 years ago

d529563bafbb789c1c0ed1d157b668231d44073c.psgcluster.testing.log

Seeing no failures on the V100 at psgcluster:

tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[32-random] PASSED [  8%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[32-arange_mod] PASSED [  9%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[32-ones] PASSED [  9%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[64-random] PASSED [ 10%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[64-arange_mod] PASSED [ 10%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[64-ones] PASSED [ 11%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[256-random] PASSED [ 11%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[256-arange_mod] PASSED [ 12%]
tmol/tests/kinematics/test_gpu_operations.py::test_derivsum_consistency[256-ones] PASSED [ 13%]
tmol/tests/kinematics/test_gpu_operations.py::test_refold_consistency[32] PASSED [ 13%]
tmol/tests/kinematics/test_gpu_operations.py::test_refold_consistency[64] PASSED [ 14%]
tmol/tests/kinematics/test_gpu_operations.py::test_refold_consistency[256] PASSED [ 14%]


--- nvidia-smi
Wed Jul 18 13:59:08 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:05:00.0 Off |                    0 |
| N/A   32C    P0    24W / 250W |      4MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   32C    P0    25W / 250W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:84:00.0 Off |                    0 |
| N/A   29C    P0    24W / 250W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   31C    P0    24W / 250W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

asford commented 6 years ago

I'm pretty much convinced that this issue is, somehow, linked to the 1050 hardware that we're using in the current dev box. I've rerun on a panel of different cards and haven't been able to reproduce the failure.

@aleaverfay I think we can probably let this one rest for the time being. We'll plan on moving the CI server off the 1050s asap. We are currently looking into pulling a few 1080 TIs out the digs queue to dedicate to this as a dev/test box.

Tested platforms:

hsw_v100_16gb
hsw_p100
ivb_k20
ivb_k40
ivb_k80
ivb_m40
ivb_m60

aleaverfay commented 6 years ago

(Reposted from slack) @asford OK -- so I have 100% narrowed down the problem to a particular write/read to shared memory, and if I rearrange how that is done or add synchronization surrounding that write/read, I fix it. The test case is working. What I don't have, however, is an understanding as to why the synchronization is necessary. here's the git diff:

diff --git a/tmol/kinematics/gpu_operations/scan_jit.py b/tmol/kinematics/gpu_operations/scan_jit.py
index 6fcf154..caeecfa 100644
--- a/tmol/kinematics/gpu_operations/scan_jit.py
+++ b/tmol/kinematics/gpu_operations/scan_jit.py
@@ -163,7 +163,9 @@ class GenerationalSegmentedScan:

                         ### Read node values from global into shared
                         my_val = load(src_vals, ii_src)
-                        shared_is_root[pos] = is_path_root[ii_ind]
+                        ### version 1: my_root = is_path_root[ii_ind]
+                        ### version 1: shared_is_root[pos] = my_root
+                        shared_is_root[pos] = is_path_root[ii_ind] #broken

                         ### Sum incoming scan value from parent into node
                         # parent only set if node is root of scan
@@ -177,8 +179,12 @@ class GenerationalSegmentedScan:
                                     ), my_val
                                 )

+                    ### version 2: cuda.syncthreads()
+                    ### version 2: if ii_ind < end:
+
                         ### Sum carry value from previous block if node 0 is non-root.
-                        my_root = shared_is_root[pos]
+                        my_root = shared_is_root[pos] #broken
+                        ## version 2: my_root = shared_is_root[pos]
                         if pos == 0 and not my_root:
                             my_val = add(carry_val, my_val)
                             my_root |= carry_is_root

if you comment out the two lines marked broken, and then uncomment either version1 or version2 you will get working code so here's the thing I don't get, and that is worrisome it doesn't feel like writing to shared_is_root[pos] and then later reading from shared_is_root[pos] without a synchronization event in between should be an issue the only thing that is a tiny bit weird is that the code (the next line after the git diff) writes to shared_is_root[0] instead of shared_is_root[pos] but that should not be problematic because the if statement has just assured us that pos == 0 furthermore (and this is getting deep into the weeds) I am 99% certain that the carry_is_root variable is unnecessary and that final write to shared_is_root[0] is also unnecessary -- that these values cannot have an impact on the behavior of scan. I added them because it felt like I might need to carry these results forward, but I'm pretty sure I can axe them (this is something I can test) ... (then I went and tested it)... right -- neither carry_is_root nor the second write to shared_is_root[0] is necessary. After thread 0 has accumulated the carry_val, it is done since pos >= offset is false for all values of offset when pos == 0. but removing the second write to shared_is_root[0] still does not fix the synchronization bug I'm looking around for advice such as "don't have one thread read from shared memory while another thread is writing to shared memory, even if they're reading/writing to different positions" -- advice I've never encountered when learning about shared memory

aleaverfay commented 6 years ago

PS: 1) This feels like a compiler bug or, as you seem to be suspecting, a hardware bug,

2) Perhaps also worth saying: version 1 above minimizes the number of shared memory transactions over the lines I've commented with #broken.

asford commented 6 years ago

:thinking: I was able to most easily grok the changes in (version 1), but I'll definitely defer to whatever makes the most sense to you.

Should we take the changes currently in the "investigate_90" branch, add the v1 fix and open a PR?

aleaverfay commented 6 years ago

I like version 1 even if it doesn't fix a bug. I think loading data into local memory and then into shared memory makes a bit more sense. (Version 2 as I described it was incomplete/incorrect, fwiw -- you need to keep the first of the two lines marked "broken", loading shared memory with the contents of global memory.)

aleaverfay commented 6 years ago

Fixed in #91

uw-ipd / tmol

Intermittent failure of derivsum operations. #90

--