kmkurn / pytorch-crf

(Linear-chain) Conditional random field in PyTorch.
https://pytorch-crf.readthedocs.io
MIT License
942 stars 152 forks source link

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. #119

Open zhangyuqi-1 opened 1 month ago

zhangyuqi-1 commented 1 month ago

self.all_crf_list:[CRF(num_tags=3), CRF(num_tags=13)] all_logits_list: [tensor([[[-3.9874e+36, 1.4790e+13, -3.9874e+36], [-3.9874e+36, -7.3833e+12, -3.9874e+36], [-3.9874e+36, -2.0520e+12, -3.9874e+36], ..., [-3.9874e+36, -2.2289e+12, -3.9874e+36], [-3.9874e+36, 3.4035e+12, -3.9874e+36], [-3.9874e+36, -4.3337e+12, -3.9874e+36]],

    [[-3.9874e+36, -2.3061e+12, -3.9874e+36],
     [-3.9874e+36, -5.0479e+12, -3.9874e+36],
     [-3.9874e+36, -6.4114e+12, -3.9874e+36],
     ...,
     [-3.9874e+36, -4.6448e+12, -3.9874e+36],
     [-3.9874e+36, -4.5191e+11, -3.9874e+36],
     [-3.9874e+36, -3.6027e+12, -3.9874e+36]],

    [[-3.9874e+36, -1.0979e+13, -3.9874e+36],
     [-3.9874e+36,  1.6786e+12, -3.9874e+36],
     [-3.9874e+36, -1.4357e+13, -3.9874e+36],
     ...,
     [-3.9874e+36, -1.4507e+12, -3.9874e+36],
     [-3.9874e+36, -1.1688e+12, -3.9874e+36],
     [-3.9874e+36, -3.1413e+12, -3.9874e+36]],

    ...,

    [[-3.9874e+36, -9.3924e+12, -3.9874e+36],
     [-3.9874e+36, -7.9630e+12, -3.9874e+36],
     [-3.9874e+36, -1.8307e+13, -3.9874e+36],
     ...,
     [-3.9874e+36,  2.5791e+12, -3.9874e+36],
     [-3.9874e+36, -5.5233e+12, -3.9874e+36],
     [-3.9874e+36, -4.3978e+12, -3.9874e+36]],

    [[-3.9874e+36, -5.4722e+12, -3.9874e+36],
     [-3.9874e+36, -7.1854e+12, -3.9874e+36],
     [-3.9874e+36,  8.8407e+12, -3.9874e+36],
     ...,
     [-3.9874e+36, -8.6247e+12, -3.9874e+36],
     [-3.9874e+36, -8.2108e+12, -3.9874e+36],
     [-3.9874e+36, -8.4145e+12, -3.9874e+36]],

    [[-3.9874e+36,  8.5906e+12, -3.9874e+36],
     [-3.9874e+36,  7.3230e+12, -3.9874e+36],
     [-3.9874e+36,  9.9933e+12, -3.9874e+36],
     ...,
     [-3.9874e+36, -8.2361e+12, -3.9874e+36],
     [-3.9874e+36,  4.3975e+12, -3.9874e+36],
     [-3.9874e+36,  2.7981e+12, -3.9874e+36]]], device='cuda:0',
   grad_fn=<ViewBackward0>), tensor([[[ 1.9106e+35,  2.6347e+13,  2.6358e+13,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 8.9022e+36,  1.2304e+13,  1.2308e+13,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 1.7277e+36,  2.7635e+13,  2.7643e+13,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     ...,
     [ 1.1412e+35,  1.4332e+13,  1.4339e+13,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 5.5772e+36,  1.0666e+13,  1.0674e+13,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 5.5625e+36,  1.0355e+13,  1.0363e+13,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00]],

    [[ 3.7168e+36,  1.8778e+12,  1.8797e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 4.9625e+36, -2.1684e+12, -2.1664e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 3.0125e+36,  2.0084e+13,  2.0092e+13,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     ...,
     [ 5.3837e+36,  5.4529e+12,  5.4594e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 4.9884e+36,  1.1655e+13,  1.1666e+13,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 4.5793e+36,  4.0014e+12,  4.0107e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00]],

    [[ 2.6895e+36, -6.3378e+12, -6.3468e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 3.2308e+36,  1.3881e+13,  1.3889e+13,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 5.6345e+36, -1.6707e+13, -1.6720e+13,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     ...,
     [-1.3854e+36,  5.2608e+12,  5.2626e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [-2.0923e+36,  4.7293e+12,  4.7297e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [-2.6322e+36,  1.3399e+12,  1.3388e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00]],

    ...,

    [[ 3.2878e+36, -3.0469e+12, -3.0491e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 7.7211e+36,  1.0207e+13,  1.0212e+13,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [-7.8782e+35, -7.6391e+11, -7.6446e+11,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     ...,
     [-2.3411e+36,  4.7278e+12,  4.7314e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 5.0340e+34,  5.7587e+12,  5.7638e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 4.7138e+36, -2.1504e+12, -2.1481e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00]],

    [[-1.4105e+36, -2.8080e+12, -2.8100e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 2.1937e+36,  9.3866e+12,  9.3888e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [-2.4729e+36,  6.7364e+12,  6.7393e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     ...,
     [-2.9569e+36,  4.7332e+12,  4.7349e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [-4.6596e+36,  5.5627e+12,  5.5597e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [-5.2336e+36,  1.0084e+13,  1.0084e+13,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00]],

    [[-4.4111e+36,  1.4338e+13,  1.4346e+13,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 4.9879e+36,  1.0435e+13,  1.0440e+13,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [-1.9851e+36,  1.9111e+13,  1.9120e+13,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     ...,
     [ 3.9176e+36,  9.3128e+11,  9.3609e+11,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 4.3824e+36,  7.6145e+12,  7.6245e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00],
     [ 4.4330e+36,  7.0763e+12,  7.0844e+12,  ...,  0.0000e+00,
       0.0000e+00,  0.0000e+00]]], device='cuda:0',
   grad_fn=<ViewBackward0>)]

self.labels_split= [tensor([[-100, 1, 2, ..., -100, -100, -100], [-100, 1, 2, ..., -100, -100, -100], [-100, 1, 2, ..., -100, -100, -100], ..., [-100, 1, 2, ..., -100, -100, -100], [-100, 0, 0, ..., -100, -100, -100], [-100, 1, 2, ..., -100, -100, -100]], device='cuda:0'), tensor([[-100, 6, 12, ..., -100, -100, -100], [-100, 6, 12, ..., -100, -100, -100], [-100, 0, 0, ..., -100, -100, -100], ..., [-100, 0, 0, ..., -100, -100, -100], [-100, 0, 0, ..., -100, -100, -100], [-100, 6, 12, ..., -100, -100, -100]], device='cuda:0')]

[-crf(lo, la, reduction='mean') for crf, lo, la in zip(self.all_crf_list, all_logits_list, labels_split)] 跑不通 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

zhangyuqi-1 commented 1 month ago

File "/data1/zhangyq/change-records-analysis/my_model.py", line 63, in forward all_loss_list = [-crf(lo, la, reduction='mean') for crf, lo, la in ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data1/zhangyq/change-records-analysis/my_model.py", line 63, in all_loss_list = [-crf(lo, la, reduction='mean') for crf, lo, la in ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhangyq/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhangyq/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data1/zhangyq/miniforge3/envs/py311/lib/python3.11/site-packages/torchcrf/init.py", line 94, in forward mask = torch.ones_like(tags, dtype=torch.uint8) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

zhangyuqi-1 commented 1 month ago

切换到CPU环境后,报错 IndexError: index -100 is out of bounds for dimension 0 with size 3 看来是-100的原因

zhangyuqi-1 commented 1 month ago

我改了代码让-100对应的位置全部mask.,但是还是报错, 最终我将-100替换成了0,成功运行了,但是0是有具体的label的,不知道添加mask后会如何处理,是否会忽略计算。之所以加-100是因为pytorch会忽略对应的值,不进行梯度计算。我在想这个地方是否可以在源码层面改进下

kmkurn commented 1 month ago

Hi, can you please post in English? I don't know Chinese. Also, please post a minimal code to reproduce the error.

zhangyuqi-1 commented 1 month ago

嗨,你能用英文发帖吗?我不懂中文。另外,请发布最少的代码来重现错误。

sorry,sorry,This is my mistake

kmkurn commented 1 month ago

Do you still need help on this issue? If so, please post in English. Otherwise, I’ll close the issue.On 31 Jul 2024, at 11:38 AM, zhangyuqi-1 @.***> wrote:

嗨,你能用英文发帖吗?我不懂中文。另外,请发布最少的代码来重现错误。

sorry,sorry,This is my mistake

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

zhangyuqi-1 commented 1 month ago

Situation Description: When the tag contains the label -100, an error occurs, even if the -100 label is in the mask. The purpose of the -100 label is to make PyTorch ignore the corresponding token. See the following example code from the transformers: Link to the code When I changed the label from -100 to 0, it worked, and the operation proceeded smoothly.

My questions are as follows:

1、By changing the label from -100 to 0 and masking all the positions corresponding to the -100 label, will there be no impact on the training in the end, as the 0 label has a specific meaning. 2、Is there any consideration for improving this aspect, so that the CRF module can handle the -100 label with ease? Thank you for your reply, very much appreciated.

zhangyuqi-1 commented 1 month ago

`import torch from torchcrf import CRF

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

num_tags = 5 model = CRF(num_tags).to(device) seq_length = 3 batch_size = 2 emissions = torch.randn(seq_length, batch_size, num_tags).to(device) tags = torch.tensor([[0, 1], [2, 4], [3, -100]], dtype=torch.long).to(device) model(emissions, tags) mask = torch.tensor([[1, 1], [1, 1], [1, 0]], dtype=torch.uint8).to(device) model(emissions, tags, mask=mask)`

Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

kmkurn commented 1 month ago

Thank you for your clear question. Yes, this is an expected behaviour with the -100.

  1. Please see https://github.com/kmkurn/pytorch-crf/issues/106#issuecomment-1335959129 for an answer to your question.
  2. No plan for this currently, but I'm happy to accept a PR on it :-)