RuntimeError: CUDA error: device-side assert triggered

shivam1702 commented 4 years ago

I was trying to run the model on my custom data of KG triples, to compare its performance, however I encountered a problem.

Upon running the training command for policy gradient model: ./experiment.sh configs/<model>.sh --train 0

Encountered the following error: RuntimeError: CUDA error: device-side assert triggered

Full stack trace:

 33%|████████████████████████████████████████████████                                                                                                | 226/677 [01:38<02:55,  2.58it/s]
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:256: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [283,0,0], thread: [0,0,0] Assertion `sum > accZero` failed.
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/workspace/KGReasoning/code/MultiHopKG/src/experiments.py", line 765, in <module>
    run_experiment(args)
  File "/workspace/KGReasoning/code/MultiHopKG/src/experiments.py", line 746, in run_experiment
    train(lf)
  File "/workspace/KGReasoning/code/MultiHopKG/src/experiments.py", line 235, in train
    lf.run_train(train_data, dev_data)
  File "/workspace/KGReasoning/code/MultiHopKG/src/learn_framework.py", line 108, in run_train
    loss = self.loss(mini_batch)
  File "/workspace/KGReasoning/code/MultiHopKG/src/rl/graph_search/pg.py", line 58, in loss
    output = self.rollout(e1, r, e2, num_steps=self.num_rollout_steps)
  File "/workspace/KGReasoning/code/MultiHopKG/src/rl/graph_search/pg.py", line 135, in rollout
    sample_outcome = self.sample_action(db_outcomes, inv_offset)
  File "/workspace/KGReasoning/code/MultiHopKG/src/rl/graph_search/pg.py", line 205, in sample_action
    sample_outcome = sample(action_space, action_dist)
  File "/workspace/KGReasoning/code/MultiHopKG/src/rl/graph_search/pg.py", line 190, in sample
    sample_action_dist = apply_action_dropout_mask(action_dist, action_mask)
  File "/workspace/KGReasoning/code/MultiHopKG/src/rl/graph_search/pg.py", line 177, in apply_action_dropout_mask
    action_keep_mask = var_cuda(rand > self.action_dropout_rate).float()
  File "/workspace/KGReasoning/code/MultiHopKG/src/utils/ops.py", line 121, in var_cuda
    return Variable(x, requires_grad=requires_grad).cuda()
RuntimeError: CUDA error: device-side assert triggered

Kindly help me debug this, possible error sources and how to remove them.

shivam1702 commented 4 years ago

On running again, with CUDA_LAUNCH_BLOCKING=1

I get this in the error stack trace:

/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:256: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [139,0,0], thread: [0,0,0] Assertion `sum > accZero` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorScatterGather.cu line=67 error=710 : device-side assert triggered
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/workspace/KGReasoning/code/MultiHopKG/src/experiments.py", line 765, in <module>
    run_experiment(args)
  File "/workspace/KGReasoning/code/MultiHopKG/src/experiments.py", line 746, in run_experiment
    train(lf)
  File "/workspace/KGReasoning/code/MultiHopKG/src/experiments.py", line 235, in train
    lf.run_train(train_data, dev_data)
  File "/workspace/KGReasoning/code/MultiHopKG/src/learn_framework.py", line 108, in run_train
    loss = self.loss(mini_batch)
  File "/workspace/KGReasoning/code/MultiHopKG/src/rl/graph_search/pg.py", line 58, in loss
    output = self.rollout(e1, r, e2, num_steps=self.num_rollout_steps)
  File "/workspace/KGReasoning/code/MultiHopKG/src/rl/graph_search/pg.py", line 135, in rollout
    sample_outcome = self.sample_action(db_outcomes, inv_offset)
  File "/workspace/KGReasoning/code/MultiHopKG/src/rl/graph_search/pg.py", line 205, in sample_action
    sample_outcome = sample(action_space, action_dist)
  File "/workspace/KGReasoning/code/MultiHopKG/src/rl/graph_search/pg.py", line 192, in sample
    next_r = ops.batch_lookup(r_space, idx)
  File "/workspace/KGReasoning/code/MultiHopKG/src/utils/ops.py", line 33, in batch_lookup
    samples = torch.gather(M, 1, idx).view(-1)
RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorScatterGather.cu:67

davidlvxin commented 4 years ago

I have the same issue. Do you find any solutions?

davidlvxin commented 4 years ago

I have found the problems.

These codes use a small trick here. For a triple (h, r, t) in the dataset, this trick will mask some action with entity e_1, e_2, e_3 in the last step. Here, these entities meet the conditions that (h, r, e_1), (h, r, e_2), (h, r, e_3) are also in the dataset. When all entities in the action space meet the above conditions, i.e., every action leads to the right answer, this trick will bring some problems that all actions are masked and the model has no action to select.

This trick will mostly fail on a dense knowledge graph with some SPECIAL 1-N triples, i.e., a large proportion of entities are acted as the tail entity for (h, r, ?) in the knowledge graph. Some work may be needed to adapt these codes to more knowledge graphs. @todpole3

The actual trigger for the error should be at here, and the exception is

invalid multinomial distribution (sum of probabilities <= 0)

Here are codes using the trick:

def get_false_negative_mask(self, e_space, e_s, q, e_t, kg):
    answer_mask = self.get_answer_mask(e_space, e_s, q, kg)
    # This is a trick applied during training where we convert a multi-answer predction problem into several
    # single-answer prediction problems. By masking out the other answers in the training set, we are forcing
    # the agent to walk towards a particular answer.
    # This trick does not affect inference on the test set: at inference time the ground truth answer will not 
    # appear in the answer mask. This can be checked by uncommenting the following assertion statement. 
    # Note that the assertion statement can trigger in the last batch if you're using a batch_size > 1 since
    # we append dummy examples to the last batch to make it the required batch size.
    # The assertion statement will also trigger in the dev set inference of NELL-995 since we randomly 
    # sampled the dev set from the training data.
    # assert(float((answer_mask * (e_space == e_t.unsqueeze(1)).long()).sum()) == 0)
    false_negative_mask = (answer_mask * (e_space != e_t.unsqueeze(1)).long()).float()
    return false_negative_mask

todpole3 commented 4 years ago

@davidlvxin Thanks for helping w/ the trouble shooting. Unfortunately device-side assert from Pytorch is very uninformative.

However, since we have e_space != e_t.unsqueeze(1), it should guarantee that e_t is not masked and the model can select it. Hence what you have identified might not be the right cause.

Would you mind printing out your answer_mask, e_space and e_t vectors and see if anything looks wrong?

Okay, I realized that this argument is wrong if e_t is not in the action space of current state so yes, it is possible to encounter a case where all actions are masked.

Also, did you encounter the error during training cycle or inference cycle?

davidlvxin commented 4 years ago

Yeah, e_t is not masked, but what if the action space do not contain e_t? Are all actions masked?

I have print some vectors and find that only after function get_false_negative_mask all actions are masked.

By the way, I encountered the error during training.

todpole3 commented 4 years ago

@davidlvxin Thanks and sorry about the confusion. I realized the issue myself shortly after making the comment. I believe the reason we did not find this an issue in our paper is that we had augmented the graph such that each node has a self-edge. In our training data we don't have examples of self relations hence the self-edge is always the fall back solution.

So density of the graph should not be a problem, but if you have triples of the form (e1, r, e1) it is possible to arrive at a state with no actions using our code. Is this the case?

todpole3 commented 4 years ago

Also, I'm perplexed that this line throws the exception for you.

Are you setting action_dropout_rate to 0? In apply_action_dropout_mask, we use EPSILON to prevent outputing a zero vector. And if you have action_dropout_rate set to 0, action_dist is the output of a softmax and should not be zero either.

davidlvxin commented 4 years ago

I have printed the tensor sample_action_dist in this code. I found that some rows in this tensor are all zeros, which can lead to invalid multinomial distribution (sum of probabilities <= 0) error.

The reason why they are all zeros is that action_keep_mask are all zeros and action_keep_mask are all ones in this code.

I don't think (e1, r, e1) will cause this problems. Maybe I should give a more clear example.

Suppose we have (A, r, B), (A, r, C), (A, r, D), (A, r, F), (A, r_1, E), (E, r_2, F), (F, r, C), (F, r, D) (F, r, F) in a small KG. The training triple is (A, r, B). We start from entity A, and the max hop step is 3.

Our model search path is A->r_1->E->r_2->F, and this is the last step. F has three actions, i.e., (r, C), (r, D), (r, F). But C, D and F are not e_t (e_t is B), and triples (A, r, C), (A, r, D), (A, r, F) all exist in the KG. Hence, entities C, D and F should be masked. And we have no action to select.

todpole3 commented 4 years ago

Got it. The reason we design get_false_negative_mask is to prevent the model from getting punished by selecting C or D or F.

A possible fix here is to add EPSILON to sample_action_dist such that it is turned into a uniform vector if zero (the agent randomly choose C, D or F). And zero the loss if action_mask in the last step is zero (no matter which one is chosen by the agent, do not count that in the loss term).

davidlvxin commented 4 years ago

Yeah, I think it is OK :). The above trick only fails with a very small probability. And in most cases, it works fine.

todpole3 commented 4 years ago

Cool. I'll keep this issue open and push a fix at some point.

Thanks again for identifying it.

Lee-zix commented 4 years ago

Following the above question, I want to ensure the impact of the false-negative mask during the inference cycle. Is the false-negative mask actually make the model only get the right results under the filter metric??? If I want to get the results under the raw metric, the false-negative mask should be unused ??

todpole3 commented 4 years ago

"Is the false-negative mask actually make the model only get the right results under the filter metric???"

In our implementation we made sure to only include "false-negative" examples from the training KG (given triples). Surprisingly many datasets have query overlap between train/dev/test.

Lee-zix commented 4 years ago

Thanks very much for your reply! The false-negative mask filters the other answers in the train_objects/train_subjects but the filter operation filters the other answers in the all_objects/all_subjects. That is the only difference between the two operations in the inference cycle.

chrislouis0106 commented 2 years ago

Got it. The reason we design get_false_negative_mask is to prevent the model from getting punished by selecting C or D or F.

A possible fix here is to add EPSILON to sample_action_dist such that it is turned into a uniform vector if zero (the agent randomly choose C, D or F). And zero the loss if action_mask in the last step is zero (no matter which one is chosen by the agent, do not count that in the loss term).

sample_action_dist = \
        action_dist * action_keep_mask + ops.EPSILON * (1 - action_keep_mask) * action_mask + ops.EPSILON

Humble2967738843 commented 3 months ago

Greetings from 2024, has this problem been solved? I have the same problem, using triples in my own data set.

Humble2967738843 commented 3 months ago

Epoch 0 ... (more hidden) ...C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\MultinomialKernel.cu:214: block: [57,0,0], thread: [0,0,0] Assertion sum > accZero failed. ... (more hidden) ... Traceback (most recent call last): File "D:\Anaconda3\envs\PSAgent\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "D:\Anaconda3\envs\PSAgent\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\doctors\github_projects\PSAgented\data\github_projects\PSAgent\src\experiments.py", line 874, in run_experiment(args) File "D:\doctors\github_projects\PSAgented\data\github_projects\PSAgent\src\experiments.py", line 855, in run_experiment train(lf) File "D:\doctors\github_projects\PSAgented\data\github_projects\PSAgent\src\experiments.py", line 255, in train lf.run_train(train_data, dev_data) File "D:\doctors\github_projects\PSAgented\data\github_projects\PSAgent\src\learn_framework.py", line 117, in run_train loss = self.loss(mini_batch) File "D:\doctors\github_projects\PSAgented\data\github_projects\PSAgent\src\rl\graph_search\pg.py", line 130, in loss output = self.rollout(e1, r, e2, num_steps=self.num_rollout_steps) File "D:\doctors\github_projects\PSAgented\data\github_projects\PSAgent\src\rl\graph_search\pg.py", line 238, in rollout sample_outcome = self.sample_action(db_outcomes, inv_offset) File "D:\doctors\github_projects\PSAgented\data\github_projects\PSAgent\src\rl\graph_search\pg.py", line 309, in sample_action sample_outcome = sample(action_space, action_dist) File "D:\doctors\github_projects\PSAgented\data\github_projects\PSAgent\src\rl\graph_search\pg.py", line 294, in sample sample_action_dist = apply_action_dropout_mask(action_dist, action_mask) File "D:\doctors\github_projects\PSAgented\data\github_projects\PSAgent\src\rl\graph_search\pg.py", line 281, in apply_action_dropout_mask action_keep_mask = var_cuda(rand > self.action_dropout_rate).float() File "D:\doctors\github_projects\PSAgented\data\github_projects\PSAgent\src\utils\ops.py", line 121, in var_cuda return Variable(x, requires_grad=requires_grad).cuda() RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

salesforce / MultiHopKG

RuntimeError: CUDA error: device-side assert triggered #17