nv-legate / cunumeric

An Aspiring Drop-In Replacement for NumPy at Scale
https://docs.nvidia.com/cunumeric/24.06/
Apache License 2.0
620 stars 70 forks source link

linspace silently failed when using GPU without any runtime error #17

Open piyueh opened 3 years ago

piyueh commented 3 years ago

Problem

numpy.linspace does not work correctly and returns an all-zero array without throwing any runtime errors or warnings.

To reproduce

Option 1: using legate.numpy's own tests

  1. step 1: go to legate.numpy's source code folder
  2. step 2: run ./test.py --use cuda --gpus 1

Option 2:

  1. step 1: create test.py
    from legate import numpy
    a = numpy.linspace(0.0, 4.0, 501)
    print(a.mean(), a.sum())
    print(a)
  2. step 2: run the script with legate --gpus 1 ./test.py -lg:numpy:test

Resutls

When using legate.numpy's own test suite, the result show the test for linspace failed.

When using the test script in option 2, the result shows an all-zero array.

Expected results

I noticed linspace is not listed in the Legate NumPy API reference, so I think linspace belongs to the group that is not implemented yet? Then, in this case, the function should return NotImplementedError instead of just an all-zero array.

piyueh commented 3 years ago

Update:

An interesting thing happened. The test.py from option 2 suddenly worked on my workstation (with an old K40 GPU and CUDA 10.2) but still failed on DGX-1 (V100 GPU + CUDA 11.2.1). Except for the CUDA version and GPU architecture, legate.core and legate.numpy were built with the same flags. There are still some differences like gcc versions etc. Not sure if they are relevant. If more information is needed, please let me know.

Nevertheless, the test suite from legate.numpy (the option 1) still failed with both K40 and V100. And the linspace in my real application code is also still failing with both GPUs.

lightsighter commented 3 years ago

@piyueh I've been unable to reproduce this issue. I tried CUDA 11.2 on a V100 and CUDA 11.1 on a GTX 1060. In both cases, repeated execution of both the test suite and your program above failed to reproduce the issue.

Does the issue recur deterministically? Can you report your driver versions for both GPUs?

lightsighter commented 3 years ago

@piyueh Can you also try to reproduce this on the A100 machine we gave you access to?

piyueh commented 3 years ago

I cannot reproduce the issue with my current setting on A100, either (epyc_a100x4 partition; A100-PCIE-40GB; driver: 460.73.01; CUDA driver API: 11.2; CUDA Runtime: 11.2.1 and 11.3 both tried). Also, all unit tests passed! Not sure what happened. I frequently removed the virtual environments and re-created new ones. I already lost the virtual environment used when creating this issue.

However, the test failed again on my K40, even though it once worked (see my previous comment). My current K40 environment: K40c, driver: 465.24.20; CUDA driver API: 11.3; CUDA runtime: 11.3. But I think this may be a minor issue because K40 seems to be not supported officially since CUDA 11 or CUDA 10.

I haven't been able to test it again on the particular V100 used for this issue. There's a queue on it. I have to wait.

Not sure if this is relevant, but I checked the linked runtime libraries (with ldd librealm.so liblegion.so liblgcore.so liblgnumpy.so). They all seem to link to correct CUDA related libraries (libcuda, libcudart, libcublas, libcublasLt).

I'll do some more tests and let you know the results.

This is my test code:

import time
import numpy as truenumpy
from legate import numpy
results = []
truenumpy.random.seed(int(time.time()))
for i in range(10000):
    N = truenumpy.random.randint(100, 500)
    bg = truenumpy.random.random()
    ed = truenumpy.random.random()
    a = numpy.linspace(bg, ed, N)
    results.append(abs(a.sum()-(bg+ed)*N/2) < 1e-10)
print(all(results))
lightsighter commented 3 years ago

I tried running your test code above as well and so far no luck reproducing the issue on my side yet.

I frequently removed the virtual environments and re-created new ones.

Every time you do this I think you need to rebuild the Legate Core. We capture environment information about your Python installation when you build the core and we expect to be running Legate in the same environment. Can you try fresh builds of the core with your virtual environment(s)?

piyueh commented 3 years ago

Alright, I think I accidentally found the culprit! I reproduced the result using V100 when legate.core was built with --arch 80... I guess that's also what happened when I created this issue. Once I built legate.core with --arch 70,80, the test worked with V100. That's my bad. The CUDA runtime version I used is 11.2.1. The V100 is using driver 450.119.04 and CUDA driver API 11.0. Is this the correct behavior?

Every time you do this I think you need to rebuild the Legate Core. We capture environment information about your Python installation when you build the core and we expect to be running Legate in the same environment.

I always build and install Legate to the prefix of virtual environments, so each environment has its own Legate. When removing an environment, the Legate for that environment is also gone. And when creating a new environment, I have to re-build and install legate.core and legate.numpy. Is what I'm doing safe?

lightsighter commented 3 years ago

Alright, I think I accidentally found the culprit! I reproduced the result using V100 when legate.core was built with --arch 80

I'm surprised that that does not raise a CUDA driver error. Somewhere we must have a CUDA call that we are not checking the error code for and the driver is returning a non-zero CUDA error for a kernel launch with the wrong architecture type, but we're not catching it. @streichler: with Realm's CUDA hijack off, does Realm check the error codes on all the cuCtxSynchronize calls that it does at the end of tasks (as well as any other CUDA calls)?

And when creating a new environment, I have to re-build and install legate.core and legate.numpy. Is what I'm doing safe?

I'm not sure because I don't use virtual environments myself. It might be safe. I know for certain that you cannot switch which libpython.so you are pointing at and have things still work. librealm.so links against a specific python version and configuration and if that changes nothing will work. If you change virtual environments, but your Python interpreter is still the same it should in theory still work. Go ahead and try it out and let us know if switching virtual environments works.

streichler commented 3 years ago

with Realm's CUDA hijack off, does Realm check the error codes on all the cuCtxSynchronize calls that it does at the end of tasks (as well as any other CUDA calls)?

Yes.

https://github.com/StanfordLegion/legion/blob/master/runtime/realm/cuda/cuda_module.cc#L1370-L1380

lightsighter commented 3 years ago

I've confirmed that building Legate against the wrong GPU architecture will lead to the silent errors. I've created a change to the Realm runtime that prevents these from silently failing in the future. https://gitlab.com/StanfordLegion/legion/-/merge_requests/388 We'll leave this request open until we get that merged into the legate_stable branch.

From now on though, build against the right GPU architecture and everything should be good. :)