rocmarchive / realcaffe2

The repo is obsolete. Use at your own risk.
https://github.com/pytorch/pytorch
Apache License 2.0
12 stars 2 forks source link

BatchMatMul is failing #75

Closed rohithkrn closed 6 years ago

rohithkrn commented 6 years ago

Errors for float16 type. GPU error

ERROR: test_batch_matmul (main.TestBatchMatMul)

Traceback (most recent call last): File "../caffe2/python/operator_test/matmul_op_test.py", line 145, in test_batch_matmul @given( File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 1049, in wrapped_test state.run() File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 820, in run falsifying_example.__expected_traceback, File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 581, in execute result = self.test_runner(data, run) File "/usr/local/lib/python2.7/dist-packages/hypothesis/executors.py", line 58, in default_new_style_executor return function(data) File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 573, in run return test(*args, kwargs) File "../caffe2/python/operator_test/matmul_op_test.py", line 145, in test_batch_matmul @given( File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 520, in test result = self.test(*args, *kwargs) File "../caffe2/python/operator_test/matmul_op_test.py", line 195, in test_batch_matmul relax_fp16_check(self.assertReferenceChecks, gc, op, [X, Y, trans_a, trans_b, dtype], matmul_ref) File "../caffe2/python/operator_test/matmul_op_test.py", line 192, in relax_fp16_check check_func(args, threshold=threshold, kwargs) File "/data/rocm_caffe2/build/caffe2/python/hypothesis_test_util.py", line 574, in assertReferenceChecks workspace.RunNetOnce(net) File "/data/rocm_caffe2/build/caffe2/python/workspace.py", line 216, in RunNetOnce StringifyProto(net), File "/data/rocm_caffe2/build/caffe2/python/workspace.py", line 199, in CallWithExceptionIntercept return func(*args, **kwargs) RuntimeError: [enforce fail at math_hip.cc:400] . Unsupported math type Error from operator: input: "X" input: "Y" output: "out" name: "" type: "BatchMatMul" arg { name: "trans_a" i: 0 } arg { name: "trans_b" i: 0 } device_option { device_type: 4 }

CPU error

ERROR: test_batch_matmul (main.TestBatchMatMul)

Traceback (most recent call last): File "../caffe2/python/operator_test/matmul_op_test.py", line 145, in test_batch_matmul @given( File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 1049, in wrapped_test state.run() File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 820, in run falsifying_example.__expected_traceback, File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 581, in execute result = self.test_runner(data, run) File "/usr/local/lib/python2.7/dist-packages/hypothesis/executors.py", line 58, in default_new_style_executor return function(data) File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 573, in run return test(*args, kwargs) File "../caffe2/python/operator_test/matmul_op_test.py", line 145, in test_batch_matmul @given( File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 520, in test result = self.test(*args, *kwargs) File "../caffe2/python/operator_test/matmul_op_test.py", line 195, in test_batch_matmul relax_fp16_check(self.assertReferenceChecks, gc, op, [X, Y, trans_a, trans_b, dtype], matmul_ref) File "../caffe2/python/operator_test/matmul_op_test.py", line 192, in relax_fp16_check check_func(args, threshold=threshold, kwargs) File "/data/rocm_caffe2/build/caffe2/python/hypothesis_test_util.py", line 574, in assertReferenceChecks workspace.RunNetOnce(net) File "/data/rocm_caffe2/build/caffe2/python/workspace.py", line 216, in RunNetOnce StringifyProto(net), File "/data/rocm_caffe2/build/caffe2/python/workspace.py", line 199, in CallWithExceptionIntercept return func(*args, **kwargs) RuntimeError: [enforce fail at operator.h:640] . Unsupported type of tensor: caffe2::__f16 Error from operator: input: "X" input: "Y" output: "out" name: "" type: "BatchMatMul" arg { name: "trans_a" i: 0 } arg { name: "trans_b" i: 0 } device_option { }

For float32: CPU is passing but gpu is hanging for gradient check..

rohithkrn commented 6 years ago

@ashishfarmer @petrex do you have any insights? Looks like its calling rocBLAS

ashishfarmer commented 6 years ago

I am looking at this issue. One of the things seem to be synchronization issue

petrex commented 6 years ago

https://github.com/pytorch/pytorch