Open sergei-mironov opened 4 years ago
You could modify ele->use_kernel() to true to use elementwise kernel in cudnn. This method can avoid this bug.
bool Element::use_kernel(void) const {
switch (type) {
case OP_EW_ADD:
return true;
case OP_EW_MUL:
case OP_EW_MAX:
case OP_EW_MIN:
break;
default:
return false;
}
......
Hi, @Alex-Sol , I met the same problem as @grwlf . After changed Element::use_kernel function, I faced following problem in resnet50.
/home/TASO/src/cudnn/cuda_helper.cu:83: void helperSetBroadcastableTensorDescriptor(const taso::Tensor&, const taso::Tensor&, cudnnTensorDescriptor_t): Assertion `input.default_layout()' failed. Aborted (core dumped)
Could you help me to solve this problem?
I found the problem is related to gemm opearter. If I comment some code in init.py like following, I don't have layout problem. But still I need to know how to solve it perfectly. @Alex-Sol
def _gemm(op, graph, tensors, initializer): inputs = _get_inputs(op, graph, tensors, initializer) attrs = _parse_attribute(op.attribute) if "transA" in attrs and attrs["transA"] == 1: inputs[0] = graph.transpose(inputs[0], (1,0), shuffle=True) if "transB" in attrs and attrs["transB"] == 1: inputs[1] = graph.transpose(inputs[1], (1,0), shuffle=True) outputs = graph.matmul(inputs[0], inputs[1])
# outputs = graph.add(outputs, inputs[2])
return outputs
@jiahuiyang This may be a bug about mismatch of dims of bias and matmul.
I have fixed this bug like this in python/taso/__init__.py
:
if len(inputs) > 2:
dim = inputs[2].dim(0)
reshape_bias = graph.reshape(inputs[2], (1,dim))
outputs = graph.add(outputs, reshape_bias)
return outputs
@jiahuiyang This may be a bug about mismatch of dims of bias and matmul. I have fixed this bug like this in
python/taso/__init__.py
:if len(inputs) > 2: dim = inputs[2].dim(0) reshape_bias = graph.reshape(inputs[2], (1,dim)) outputs = graph.add(outputs, reshape_bias) return outputs
great. Thanks
Hi. I applied TASO (commit dce8c4d967f) to the several models of onnx-models. One frequent error I got is:
Affected models are: inception-v2-9, mnist-8.log, resnet101-v2-7, resnet18-v2-7, roberta-base-11, shufflenet-9, vgg19-7, yolov4. Since trivial mnist is in the list, I suspect that the problem was caused by some environment bug, such as package version mismatch or alike.
The error message is not very verbose, and named CUDA line doesn't look suspicious. I would be glad to provide more debugging information but unfortunately I'm not a expert in low-level CUDA. Could you please suggest what can I do to collect more information?