Crash with drjit.set_device() in some settings

yutoe05 commented 1 year ago

Thanks for this great tool! I would like to use mitsuba3 with pytorch on multiple gpus. But switching device by drjit.set_device(device_idx) (device_idx > 0) causes a crash in some settings/systems and shows this error:

Critical Dr.Jit compiler failure: cuda_check(): API error 0001 (CUDA_ERROR_INVALID_VALUE): "invalid argument" in ../ext/drjit/ext/drjit-core/src/registry.cpp:322.

or

Critical Dr.Jit compiler failure: cuda_check(): API error 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in ../ext/drjit/ext/drjit-core/src/eval.cpp:395.

These are examples of setting where a crash occurs.

common information:

In Singularity (Ubuntu20.04, cuda 11.3) on Ubuntu22.04
Python: 3.8.5
Mitsuba, Drjit: Latest version built according to these instructions

example 1

GPU: 8 NVIDIA TITAN V
device_idx: 4,5,6,7 -> CUDA_ERROR_INVALID_VALUE

example 2

GPU: 8 NVIDIA A100 80GB + 2 NVIDIA RTX A6000
device_idx: 1,2,3,4(all are A100) -> CUDA_ERROR_ILLEGAL_ADDRESS
device_idx: 8,9(all are A6000) -> CUDA_ERROR_INVALID_VALUE

example 3

GPU: 4 NVIDIA GeForce RTX 2080 Ti
device_idx: 1,2,3 -> CUDA_ERROR_INVALID_VALUE

I run this simple code:

import sys
import drjit as dr
import mitsuba as mi

device_idx = 4
mi.set_variant("cuda_ad_rgb")
dr.set_device(device_idx)

scene = mi.load_dict(mi.cornell_box()) # CUDA_ERROR_INVALID_VALUE occurs here
img = mi.render(scene)

img = img.torch() # CUDA_ERROR_ILLEGAL_ADDRESS occurs here (img.numpy() and plt.imshow(img) show same results)

Error doesn't occur when I use CUDA_VISIBLE_DEVICES=device_idx and drjit.set_device(0).

I also found following two cases.

If a GPU works well with device_idx > 0, the GPU will operate with a different non-zero device_idx shifted by using CUDA_VISIBLE_DEVICES.
If a GPU does not works with device_idx > 0, the GPU will not operate with a different non-zero device_idx shifted by using CUDA_VISIBLE_DEVICES.

So, these errors may be due to my environment.

Do you have any advice? Thanks in advance!

njroussel commented 1 year ago

Hi @yutoe05

If a GPU works well with device_idx > 0, the GPU will operate with a different non-zero device_idx shifted by using CUDA_VISIBLE_DEVICES.

If a GPU does not works with device_idx > 0, the GPU will not operate with a different non-zero device_idx shifted by using CUDA_VISIBLE_DEVICES.

This does make it rather sound like their is an issue with your environment than in Dr.Jit. Especially in the examples 1 and 3 where all GPUs are identical.

You could try some other software/library to see if you're getting similar issues. Please report back if you still believe that this is a bug in our implementation.

yutoe05 commented 1 year ago

Thanks for your answer, @njroussel.

I tried cupy, numba and pytorch on multiple GPUs. I check them with some settings/systems where mitsuba3 fails to render, and they work well.

environments:

cupy: 10.6.0
numba: 0.56.4
pytorch: 1.12.1+cu113

I run these codes:

import cupy
import numpy as np

device_idx = 1
with cupy.cuda.Device(device_idx):
    cx = cupy.arange(20)
    cx = cx + 1
print((np.arange(1, 21) == cupy.asnumpy(cx)).any())
breakpoint()

from numba import cuda
import numpy as np

@cuda.jit
def multiplication(input, times, result):
    x, y = cuda.grid(2)
    result[x, y] = input[x, y]
    for i in range(times - 1):
        result[x, y] += input[x, y]

device_idx = 1
cuda.select_device(device_idx)
order = 7
input = np.random.randn(2**order, 2**order)
input_d = cuda.to_device(input)
result = np.zeros_like(input)
result_d = cuda.to_device(result)
times = 10
multiplication[(2 ** (order - 4), 2 ** (order - 4)), (2**4, 2**4)](input_d, times, result_d)
gt = input * times
result = result_d.copy_to_host()
print((np.abs((result - gt) / gt) < 1e-3).any())
breakpoint()

import torch

device_idx = 1
torch.cuda.set_device(device_idx)
cx = torch.randn(20).cuda()
cx = cx + 1
breakpoint()
sys.exit()

Using breakpoint(), I checked that the specified device was actually used on nvidia-smi.

Certainly the behavior of mitsuba3 seems to be environmentally dependent, but I don't get similar issues with other libraries, so It may be a bug.

njroussel commented 1 year ago

@yutoe05

Thank you for this, it does seem to point to an issue on our end. We don't have any multi-GPU setups so it's a bit hard to debug this futher on our end.

I understand that it's not the most elegant, but is there anything stopping you from always using the CUDA_VISIBLE_DEVICES workaround ?

yutoe05 commented 1 year ago

Thanks for your response.

I would like to use mitsuba3 when training network of pytorch with data parallel on multiple GPUs. CUDA_VISIBLE_DEVICES affects not only mitsuba3 but also pytorch, so I can't use it.

I'll look for any workaround. Thank you very much.

njroussel commented 1 year ago

Thank you for the update. Indeed I hadn't thought of that :sweat_smile:. Technically, I think you could just do something like os.environ['CUDA_VISIBLE_DEVICES'] = ...; import drjit as dr; del os.environ['CUDA_VISIBLE_DEVICES']. Granted, this is more of a temporary workaround if you're truly stuck.

I might have an idea. Could you try re-ordering your imports and setup as follows:

import drjit as dr
device_idx = 4
dr.set_device(device_idx)

import mitsuba as mi
mi.set_variant("cuda_ad_rgb")

My best guess is that there is some global device memory allocated on the default device (device 0) when importing mitsuba or drjit and it is not moved after the set_device call. If the suggestion above works, then it's something allocated in mitsuba which sems plausible. (Your initial example imports and set the variant for mitsuba before setting the device)

yutoe05 commented 1 year ago

Thank you for great ideas!

First, I tried the following:

import torch
os.environ["CUDA_VISIBLE_DEVICES"] = str(device_idx)
import drjit as dr
import mitsuba as mi
del os.environ["CUDA_VISIBLE_DEVICES"]
print("num_gpus:", torch.cuda.device_count())

The result is num_gpus: 1. I think CUDA_VISIBLE_DEVICES is only referenced at the first time when cuda is called, and cuda settings are shared among libraries. For example,

import torch
print("num_gpus:", torch.cuda.device_count())
os.environ["CUDA_VISIBLE_DEVICES"] = str(device_idx)
import drjit as dr
import mitsuba as mi
del os.environ["CUDA_VISIBLE_DEVICES"]

results in num_gpus > 0 and drjit using the device with device_idx == 0.

Second, I tried setting device by drjit before importing mitsuba3, but the same error occured.

Third, when adding CUDA_LAUNCH_BLOCKING=1, I found that

Critical Dr.Jit compiler failure: jit_optix_check(): API error 7050 (OPTIX_ERROR_LAUNCH_FAILURE): "Launch failure" in ../ext/drjit/ext/drjit-core/src/optix_core.cpp:489.

or CUDA_ERROR_INVALID_VALUE is printed, instead of CUDA_ERROR_ILLEGAL_ADDRESS or CUDA_ERROR_INVALID_VALUE.

Therefore, I think this issue may be involved in Optix, which is not used in other libraries I tried.

I also found that some memory is allocated on all visible GPUs when importing drjit. (I'm not sure this information is helpful though.)

I'm sorry, but I'm going to be busy for a few weeks and may not be able to respond immediately.

yutoe05 commented 1 year ago

I apologize for the delayed response.

I still don't know how to use torch and drjit while switching between multiple GPU devices.

I have found, however, that by spliting a single host into multiple nodes in DistributedDataParallel of torch, along with the CUDA_VISIBLE_DEVICES environment variable, I can achieve the initial goal.

Perhaps due to bandwidth constraints, the desired processing speed could not be achieved, but I will close this issue for now.

Thank you very much, @njroussel, for taking the time to help me.

mitsuba-renderer / drjit

Crash with drjit.set_device() in some settings #119