jiazhihao / TASO

The Tensor Algebra SuperOptimizer for Deep Learning

Apache License 2.0

682 stars 89 forks source link

Cuda error 77, please suggest how to debug #68

Open sergei-mironov opened 4 years ago

sergei-mironov commented 4 years ago

Hi. I applied TASO (commit dce8c4d967f) to the several models of onnx-models. One frequent error I got is:

Cuda failure: 77
/workspace/taso/src/cudnn/element_kernel.cu:242
Aborting...

Affected models are: inception-v2-9, mnist-8.log, resnet101-v2-7, resnet18-v2-7, roberta-base-11, shufflenet-9, vgg19-7, yolov4. Since trivial mnist is in the list, I suspect that the problem was caused by some environment bug, such as package version mismatch or alike.

The error message is not very verbose, and named CUDA line doesn't look suspicious. I would be glad to provide more debugging information but unfortunately I'm not a expert in low-level CUDA. Could you please suggest what can I do to collect more information?

Alex-Sol commented 3 years ago

You could modify ele->use_kernel() to true to use elementwise kernel in cudnn. This method can avoid this bug.

bool Element::use_kernel(void) const {
    switch (type) {
        case OP_EW_ADD:
            return true;
        case OP_EW_MUL:
        case OP_EW_MAX:
        case OP_EW_MIN:
            break;
        default:
            return false;
    }
......

jiahuiyang commented 3 years ago

Hi, @Alex-Sol , I met the same problem as @grwlf . After changed Element::use_kernel function, I faced following problem in resnet50.

/home/TASO/src/cudnn/cuda_helper.cu:83: void helperSetBroadcastableTensorDescriptor(const taso::Tensor&, const taso::Tensor&, cudnnTensorDescriptor_t): Assertion `input.default_layout()' failed. Aborted (core dumped)

Could you help me to solve this problem?

jiahuiyang commented 3 years ago

I found the problem is related to gemm opearter. If I comment some code in init.py like following, I don't have layout problem. But still I need to know how to solve it perfectly. @Alex-Sol

def _gemm(op, graph, tensors, initializer): inputs = _get_inputs(op, graph, tensors, initializer) attrs = _parse_attribute(op.attribute) if "transA" in attrs and attrs["transA"] == 1: inputs[0] = graph.transpose(inputs[0], (1,0), shuffle=True) if "transB" in attrs and attrs["transB"] == 1: inputs[1] = graph.transpose(inputs[1], (1,0), shuffle=True) outputs = graph.matmul(inputs[0], inputs[1])

if len(inputs) > 2:

    # outputs = graph.add(outputs, inputs[2])
return outputs

Alex-Sol commented 3 years ago

@jiahuiyang This may be a bug about mismatch of dims of bias and matmul. I have fixed this bug like this in python/taso/__init__.py:

if len(inputs) > 2:
        dim = inputs[2].dim(0)
        reshape_bias = graph.reshape(inputs[2], (1,dim))
        outputs = graph.add(outputs, reshape_bias)
return outputs

jiahuiyang commented 3 years ago

@jiahuiyang This may be a bug about mismatch of dims of bias and matmul. I have fixed this bug like this in python/taso/__init__.py:
if len(inputs) > 2:
        dim = inputs[2].dim(0)
        reshape_bias = graph.reshape(inputs[2], (1,dim))
        outputs = graph.add(outputs, reshape_bias)
return outputs

great. Thanks