pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.11k stars 6.94k forks source link

torch.randperm() in cuda that have a wrong values when the n(int) have a big value(n > 2^12) #3816

Closed nothingwithyou closed 3 years ago

nothingwithyou commented 3 years ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

  1. When I use the tutorials in TORCHVISION OBJECT DETECTION FINETUNING TUTORIAL, I use the same code in the tutorials
  2. In the first time, I use the cpu, it is just ok, but in the evaluate, the net still use the GPU to evaluate, it is the first matter
  3. Then I use the GPU with Cuda to train, I just have one GPU, in the train, the model tell that RuntimeError: CUDA error: device-side assert triggered. I use the pytorch 1.8.1, vision 0.91
  4. Then I debug the all code, find that some pic is ok, not all. I find that when in the loss calculation, the downsample of the pos and neg have the bug. It use a funtion in _utils.py named that BalancedPositiveNegativeSampler(), it use torch.randperm(positive.numel(), device=positive.device)[:num_pos] to generate a ramdon index
  5. But I see the function return a wrong values, it is a very big value, such as 4755801207605297152, and the positive.numel() is 265826, so I try different num of int to return.Finally, I find in my computer, when the n >2^12, it will failed to return a right index list.I think the limit of the n is relate to the GPU channel or the others.
  6. I think your code to generate a random index should have a judgment, if the given num larger than the limit, it should force use of CPU

wrong history.txt

Expected behavior

Returns a random permutation of integers from 0 to n - 1

Environment

Please copy and paste the output from our wrong message.txt

(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Additional context

wrong envs.txt

datumbox commented 3 years ago

Hi @nothingwithyou,

I would recommend to open a new ticket on PyTorch and provide a set of the minimum commands to reproduce it. From what you describe something like torch.randperm(265826, device='cuda').max() should be enough to show-case any potential issue.

Unfortunately when I run the above command, I don't get any values larger than n. See below:

>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()

I would also advise to use the latest PyTorch nightly and see if the problem is resolved.

nothingwithyou commented 3 years ago

yes, if I just run the command only, it will return a right values. 屏幕截图 2021-05-13 125118 But it really happened when I debug the tutorials code.I add a Add a breakpoint in the torchvision/models/detection/_utils.py, then I run the main(), and use in the same command, it failed. And than it give me the same error: {RuntimeError}CUDA error: device-side assert triggered, so I think it is the torch.randperm(n, device='cuda') caused the problem. And I test the small value, it alreadly worked. I am a student, and just don't understand why it will work in different ways. I think the most likely reason is the process declares the use of memory, when it generates a random number, it cause the GPU memory overflow(not overflow your all GPU memory, because I see many GPU have half of it just be free). I test in the pycharm2021.1, and pydev 调试器(内部版本 211.6693.115).The following pictures are the results of my test. 屏幕截图 2021-05-13 125144 屏幕截图 2021-05-13 125318 屏幕截图 2021-05-13 131336

datumbox commented 3 years ago

@nothingwithyou I think it would be helpful if you construct a self-contained small example that reproduces the problem, so that the PyTorch code developers can investigate. If you manage to reproduce the issue in a few lines of code it will assist the investigation. Without it is, impossible to debug just by looking at screenshots. Another thing you could do is to provide the input that leads to this problem since this will help us test it.

fmassa commented 3 years ago

@nothingwithyou I think the most common cases you can have for CUDA error: device-side assert triggered is because your number of classes in the dataset doesn't match the number of classes in your model, which leaves the code in an invalid state.

Due to the asynchronous nature of CUDA, the error will only show up later, so it is probably misleading you to think the issue is in randperm.

I would recommend running the code with CUDA_LAUNCH_BLOCKING=1 python my_script.py to have the code be run in a synchronous way, so that you'll be able to more precisely identify where the issue is.

Given that this seems to be due to user error (in the mismatch of number of classes most probably), I'm closing this issue, but let us know if you still face those issues

nothingwithyou commented 3 years ago

if I don't use CUDA_LAUNCH_BLOCKING=1,it will return me the other code.I locate to this line of wrong code with line by line debug. And I use the CUDA_LAUNCH_BLOCKING=1. It give me the same error report.

And my dataset and code are all from your page:https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html, number of classes in the dataset doesn't match the number of classes in the model is just one reason to cause this problem but not this. the log are in the 1.log when I use CUDA_LAUNCH_BLOCKING=1.

1.log the following is the code (1.py is the main(), I have some small changes to reduce code)and dataset is in the https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip (the same as your Tutorial). And in the torchvision.7z dir, it is the floder in the conda env/lib/lib/python3.7/site-packages. I think it will be useful for you to reproduce the problem. subit.zip

nothingwithyou commented 3 years ago

just because the randperm return a wrong value, so the negative[perm2] cause CUDA error: device-side assert triggered. But I can not find out what make it wrong in the cuda.

nothingwithyou commented 3 years ago

@fmassa It is not the mismatch of number of classes.just because the randperm return a wrong value, so the negative[perm2] cause CUDA error: device-side assert triggered. But I can not find out what make it wrong in the cuda.

fmassa commented 3 years ago

@nothingwithyou I've tried running the example colab notebook from https://colab.research.google.com/github/pytorch/vision/blob/temp-tutorial/tutorials/torchvision_finetuning_instance_segmentation.ipynb without modifications and everything worked without issues.

I've tried downloading the subit.zip and the 1.log files, but the zip has just the whole torchvision inside so it's not possible to easily spot what modifications you might have done to torchvision.

From 1.log, the CUDA assert that was triggered was in IndexKernel, which is used in advanced indexing like tensor[[0, 4, 2]], showing that there is an index out of bonds in there https://github.com/pytorch/pytorch/blob/ccd7141919f2d94f2fb0cc077f9b23e24e84dfff/aten/src/ATen/native/cuda/IndexKernel.cu#L140

But the rest of your error message seems to indicate that this error was hit in torch.where(matched_idxs_per_image >= 1)[0], which doesn't call into IndexKernel, which makes me suspect that your log in 1.log was not run using CUDA_LAUNCH_BLOCKING=1 python my_script.py.

nothingwithyou commented 3 years ago

@fmassa I am so sorry that I subit the wrong old file. In those days I'm busy revising my graduation thesis and other errands.Here is the true log. 1.log And the whole torchvision are which I install by pip, I don't have any change.The colab notebook use the torchvision=0.3, but I just used the latest. I think the bug may not recur on all devices and environments, but at least it does in my environment.Is somethong wrong in my torchvision or my cuda?

fmassa commented 3 years ago

@nothingwithyou I'm not clear what issue you might be facing here. I still suspect this error you are facing is a red herring and the issue is in the data / labels, but without a small reproduction it is hard for us to pinpoint where the (potential) issue might be,

Haichao-Zhang commented 3 years ago

Hi, we encountered a similar issue recently (torch==1.8.1+cu111): https://github.com/HorizonRobotics/alf/pull/896

Description: cuda version of torch.randperm(n) seems to have a bug/unexpected behavior when n is a large number, sometimes generating all zeros, or negative or very large values that cause out of bound kernel error, when running together with other codes, similar to the issue described by @nothingwithyou

My current solution is to use np.random.permutation instead of torch.randperm for now, but it would be better to be able to switch back to torch.randperm in the future.

fmassa commented 3 years ago

@Haichao-Zhang can you provide a minimum working example that allows us to pinpoint the issue? Something in the lines of

torch.manual_seed(42)  # for reproducibility

r = torch.randperm(1000000)
# show issue with r somehow
assert r.min() < 0  # bug!! 

Without such examples, it might be very hard to debug. If this doesn't happen on isolated examples, this can become tricky to fix.

Also, would it be possible to try using a nightly version of PyTorch? There have been some improvements that happened on randperm since the 1.8.1 release and maybe this issue has been fixed already.

Haichao-Zhang commented 3 years ago

@Haichao-Zhang can you provide a minimum working example that allows us to pinpoint the issue? Something in the lines of

torch.manual_seed(42)  # for reproducibility

r = torch.randperm(1000000)
# show issue with r somehow
assert r.min() < 0  # bug!! 

Without such examples, it might be very hard to debug. If this doesn't happen on isolated examples, this can become tricky to fix.

Also, would it be possible to try using a nightly version of PyTorch? There have been some improvements that happened on randperm since the 1.8.1 release and maybe this issue has been fixed already.

Hi @fmassa , I have provided a detailed reproducible example with more details here: https://github.com/pytorch/pytorch/issues/59756 Hope it helps.

qianyizhang commented 2 years ago

Hi @fmassa, I am experiencing the same issue with (torch==1.8.1+cu111) on Ubuntu 16.04 with A100 card

import torch
# this is fine
idx = torch.randperm(2**14, device="cuda:0", dtype=torch.long)[:2]
print(idx) # tensor([13049,  6236], device='cuda:0')

# this is also fine
idx = torch.randperm(2**15, device="cpu", dtype=torch.long)[:2]
print(idx) # tensor([23385, 21083])

# this is buggy
idx = torch.randperm(2**15, device="cuda:0", dtype=torch.long)[:2]
print(idx) # tensor([ 336033229560773140, 4114788168274451291], device='cuda:0')
qianyizhang commented 2 years ago

this is fking strange, I can't even faithfully reproduce this error every time.., however this does crash my training in detectron2 from time to time as a workaround I replace randperm with randint when N is large

dtch1997 commented 1 year ago

I also encountered the issue where torch.randperm on device:cuda0 returned very large values outside the expected bounds whereas torch.randperm on CPU worked correctly.

As a workaround, I initialized the randperm on CPU and then moved it to the GPU using torch.randperm(n).to(device)