namisan / mt-dnn

Multi-Task Deep Neural Networks for Natural Language Understanding
MIT License
2.24k stars 411 forks source link

Ranking task batch size mismtach in ranking loss #145

Closed saransh-mehta closed 4 years ago

saransh-mehta commented 4 years ago

I was trying to use custom data for passage ranking by creating a task with data_format 'PremiseAndMultiHypothesis'. While running train.py with RankCeCriterion, I'm getting the following error:

02/21/2020 09:35:46 Total number of params: 109483778
02/21/2020 09:35:46 At epoch 0

Traceback (most recent call last):
  File "train.py", line 404, in <module>
    main()
  File "train.py", line 336, in main
    model.update(batch_meta, batch_data)
  File "<path>/mt-dnn/mt_dnn/model.py", line 175, in update
    loss = self.task_loss_criterion[task_id](logits, y, weight, ignore_index=-1)
  File "<path>/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "<path>/mt-dnn/mt_dnn/loss.py", line 86, in forward
    loss = F.cross_entropy(input, target, ignore_index=ignore_index)
  File "<path>/anaconda3/lib/python3.7/site-packages/apex/amp/wrap.py", line 28, in wrapper
    return orig_fn(*new_args, **kwargs)
  File "<path>/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1995, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "<path>/anaconda3/lib/python3.7/site-packages/apex/amp/wrap.py", line 28, in wrapper
    return orig_fn(*new_args, **kwargs)
  File "<path>/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1822, in nll_loss
    .format(input.size(0), target.size(0)))
ValueError: Expected input batch_size (32) to match target batch_size (16).

I tried by reshaping/commenting the reshaping of input and target in RankCeCriterion mentioned in loss.py,

class RankCeCriterion(Criterion):
    def __init__(self, alpha=1.0, name='Cross Entropy Criterion'):
        super().__init__()
        self.alpha = alpha
        self.name = name

    def forward(self, input, target, weight=None, ignore_index=-1, pairwise_size=1):

        #input = input.view(-1, pairwise_size)
        #target = target.contiguous().view(-1, pairwise_size)[:, 0]
        #target = target.contiguous()[:,0].view(-1, pairwise_size)

        if weight:
            loss = torch.mean(F.cross_entropy(input, target, reduce=False, ignore_index=ignore_index) * weight)
        else:
            loss = F.cross_entropy(input, target, ignore_index=ignore_index)
        loss = loss * self.alpha
        print(loss)
        return loss

But this is triggering cuda device-side assert error with following trace

02/21/2020 09:46:29 Total number of params: 109483778
02/21/2020 09:46:29 At epoch 0

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCReduceAll.cuh line=327 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "train.py", line 404, in <module>
    main()
  File "train.py", line 336, in main
    model.update(batch_meta, batch_data)
  File "<path>/mt-dnn/mt_dnn/model.py", line 175, in update
    loss = self.task_loss_criterion[task_id](logits, y, weight, ignore_index=-1)
  File "<path>/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "<path>/mt-dnn/mt_dnn/loss.py", line 86, in forward
    print(loss)
  File "<path>/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 82, in __repr__
    return torch._tensor_str._str(self)
  File "<path>/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 300, in _str
    tensor_str = _tensor_str(self, indent)
  File "<path>/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 201, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "<path>/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 87, in __init__
    nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:327

I'm using the following command to run train.py

python train.py --init_checkpoint=mt_dnn_models/bert_model_base_uncased.pt \
    --data_dir=data/bert_uncased \
    --task_def=task_def.yml \
    --train_datasets=ranking \
    --test_datasets=ranking

task_def.yml

ranking:
    data_format: PremiseAndMultiHypothesis
    encoder_type: BERT
    dropout_p: 0.05
    enable_san: false
    n_class: 2
    metric_meta:
    - ACC
    - MCC
    loss: RankCeCriterion
    task_type: Ranking
    split_names:
    - train
    - dev
    - test

My tsv data row before running prepo_std.py looks like following

28028   0,1,2,3,4,5,6,7,8,9 0,0,0,0,1,0,0,0,0,0 <Premise>    <Hypothesis1>  <Hypothesis2>    <Hypothesis3>....<Hypothesis10>

That is "id"\t"ruids"\t"label"\t"premise"\t"hypothesis1"\t"hypothesis2"...

I also have couple of more concerns

  1. What is the ruids for ranking ? (the 2nd field of data. To make it work, I'd generated a range sequence as dummy ids).
  2. How to decide n_class for ranking? (It seemed to me that Ranking is modeled as pairwise binary classification, so choose n_class as 2).

Other tasks of Classification for PremiseOnly and PremiseAndOneHypothesis worked fine for me. This issue comes only during Ranking.

Please help me find a solution. I can provide any additional information required on the issue. Thanks in advance!!

namisan commented 4 years ago

Here, it supports pair-wise ranking. Supports we have two samples, A (positive), B (negative), ruid is id of A/B which may be useful for model evaluation. For ranking, it maps A/B to a scale measuring relevance score, thus you need to set n_class to 1. The objective is to max of likelihood of A. Hope this helps. Xiaodong