Open piyueh opened 3 years ago
Update:
An interesting thing happened. The test.py
from option 2 suddenly worked on my workstation (with an old K40 GPU and CUDA 10.2) but still failed on DGX-1 (V100 GPU + CUDA 11.2.1). Except for the CUDA version and GPU architecture, legate.core
and legate.numpy
were built with the same flags. There are still some differences like gcc versions etc. Not sure if they are relevant. If more information is needed, please let me know.
Nevertheless, the test suite from legate.numpy
(the option 1) still failed with both K40 and V100. And the linspace
in my real application code is also still failing with both GPUs.
@piyueh I've been unable to reproduce this issue. I tried CUDA 11.2 on a V100 and CUDA 11.1 on a GTX 1060. In both cases, repeated execution of both the test suite and your program above failed to reproduce the issue.
Does the issue recur deterministically? Can you report your driver versions for both GPUs?
@piyueh Can you also try to reproduce this on the A100 machine we gave you access to?
I cannot reproduce the issue with my current setting on A100, either (epyc_a100x4
partition; A100-PCIE-40GB; driver: 460.73.01; CUDA driver API: 11.2; CUDA Runtime: 11.2.1 and 11.3 both tried). Also, all unit tests passed! Not sure what happened. I frequently removed the virtual environments and re-created new ones. I already lost the virtual environment used when creating this issue.
However, the test failed again on my K40, even though it once worked (see my previous comment). My current K40 environment: K40c, driver: 465.24.20; CUDA driver API: 11.3; CUDA runtime: 11.3. But I think this may be a minor issue because K40 seems to be not supported officially since CUDA 11 or CUDA 10.
I haven't been able to test it again on the particular V100 used for this issue. There's a queue on it. I have to wait.
Not sure if this is relevant, but I checked the linked runtime libraries (with ldd librealm.so liblegion.so liblgcore.so liblgnumpy.so
). They all seem to link to correct CUDA related libraries (libcuda, libcudart, libcublas, libcublasLt).
I'll do some more tests and let you know the results.
This is my test code:
import time
import numpy as truenumpy
from legate import numpy
results = []
truenumpy.random.seed(int(time.time()))
for i in range(10000):
N = truenumpy.random.randint(100, 500)
bg = truenumpy.random.random()
ed = truenumpy.random.random()
a = numpy.linspace(bg, ed, N)
results.append(abs(a.sum()-(bg+ed)*N/2) < 1e-10)
print(all(results))
I tried running your test code above as well and so far no luck reproducing the issue on my side yet.
I frequently removed the virtual environments and re-created new ones.
Every time you do this I think you need to rebuild the Legate Core. We capture environment information about your Python installation when you build the core and we expect to be running Legate in the same environment. Can you try fresh builds of the core with your virtual environment(s)?
Alright, I think I accidentally found the culprit! I reproduced the result using V100 when legate.core
was built with --arch 80
... I guess that's also what happened when I created this issue. Once I built legate.core
with --arch 70,80
, the test worked with V100. That's my bad. The CUDA runtime version I used is 11.2.1. The V100 is using driver 450.119.04 and CUDA driver API 11.0. Is this the correct behavior?
Every time you do this I think you need to rebuild the Legate Core. We capture environment information about your Python installation when you build the core and we expect to be running Legate in the same environment.
I always build and install Legate to the prefix of virtual environments, so each environment has its own Legate. When removing an environment, the Legate for that environment is also gone. And when creating a new environment, I have to re-build and install legate.core
and legate.numpy
. Is what I'm doing safe?
Alright, I think I accidentally found the culprit! I reproduced the result using V100 when legate.core was built with --arch 80
I'm surprised that that does not raise a CUDA driver error. Somewhere we must have a CUDA call that we are not checking the error code for and the driver is returning a non-zero CUDA error for a kernel launch with the wrong architecture type, but we're not catching it. @streichler: with Realm's CUDA hijack off, does Realm check the error codes on all the cuCtxSynchronize
calls that it does at the end of tasks (as well as any other CUDA calls)?
And when creating a new environment, I have to re-build and install legate.core and legate.numpy. Is what I'm doing safe?
I'm not sure because I don't use virtual environments myself. It might be safe. I know for certain that you cannot switch which libpython.so
you are pointing at and have things still work. librealm.so
links against a specific python version and configuration and if that changes nothing will work. If you change virtual environments, but your Python interpreter is still the same it should in theory still work. Go ahead and try it out and let us know if switching virtual environments works.
with Realm's CUDA hijack off, does Realm check the error codes on all the
cuCtxSynchronize
calls that it does at the end of tasks (as well as any other CUDA calls)?
Yes.
https://github.com/StanfordLegion/legion/blob/master/runtime/realm/cuda/cuda_module.cc#L1370-L1380
I've confirmed that building Legate against the wrong GPU architecture will lead to the silent errors. I've created a change to the Realm runtime that prevents these from silently failing in the future.
https://gitlab.com/StanfordLegion/legion/-/merge_requests/388
We'll leave this request open until we get that merged into the legate_stable
branch.
From now on though, build against the right GPU architecture and everything should be good. :)
Problem
numpy.linspace
does not work correctly and returns an all-zero array without throwing any runtime errors or warnings.To reproduce
Option 1: using
legate.numpy
's own testslegate.numpy
's source code folder./test.py --use cuda --gpus 1
Option 2:
test.py
legate --gpus 1 ./test.py -lg:numpy:test
Resutls
When using
legate.numpy
's own test suite, the result show the test forlinspace
failed.When using the test script in option 2, the result shows an all-zero array.
Expected results
I noticed
linspace
is not listed in the Legate NumPy API reference, so I thinklinspace
belongs to the group that isnot implemented yet
? Then, in this case, the function should returnNotImplementedError
instead of just an all-zero array.