Open xuyifangreeneyes opened 4 years ago
System information
- OS Platform and Distribution: Linux Ubuntu 18.04
- TensorFlow version: 1.13.1 (with GPU support)
- Python version: 3.7.7
- CUDA/cuDNN version: 10.0 / 7
- GPU: Tesla T4
Encountered problem I tried both
pip install blocksparse
and building from source. After installation, I can runimport blocksparse
in Python and pass most tests. However, when I runtest/blocksparse_conv_test.py
, the following error occurred.(tf13) ubuntu@xxx:~/blocksparse$ python test/blocksparse_conv_test.py /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) WARNING:tensorflow:From /home/ubuntu/anaconda3/lib/python3.7/contextlib.py:82: TensorFlowTestCase.test_session (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `self.session()` or `self.cached_session()` instead. 2020-07-19 15:22:55.214905: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2020-07-19 15:22:55.236910: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz 2020-07-19 15:22:55.237482: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55b7bd771c50 executing computations on platform Host. Devices: 2020-07-19 15:22:55.237509: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined> 2020-07-19 15:22:55.362344: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-19 15:22:55.363172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59 pciBusID: 0000:00:1e.0 totalMemory: 14.75GiB freeMemory: 14.65GiB 2020-07-19 15:22:55.363193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2020-07-19 15:22:55.393925: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-07-19 15:22:55.393972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2020-07-19 15:22:55.393981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2020-07-19 15:22:55.394077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14241 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5) 2020-07-19 15:22:55.395613: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55b7bbfe59e0 executing computations on platform CUDA. Devices: 2020-07-19 15:22:55.395639: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla T4, Compute Capability 7.5 test1 2020-07-19 15:22:55.429514: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at blocksparse_conv_op.cc:320 : Internal: device kernel image is invalid ERROR:tensorflow:device kernel image is invalid [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]] [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]] Caused by op 'test1/F4B4/BlocksparseConv', defined at: File "test/blocksparse_conv_test.py", line 213, in <module> tf.test.main() File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/test.py", line 64, in main return _googletest.main(argv) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 100, in main benchmark.benchmarks_main(true_main=main_wrapper) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/benchmark.py", line 371, in benchmarks_main true_main() File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 99, in main_wrapper return app.run(main=g_main, argv=args) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 70, in g_main return unittest_main(argv=argv) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 101, in __init__ self.runTests() File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 271, in runTests self.result = testRunner.run(self.test) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/runner.py", line 176, in run test(result) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__ return self.run(*args, **kwds) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run test(result) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__ return self.run(*args, **kwds) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run test(result) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 676, in __call__ return self.run(*args, **kwds) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 628, in run testMethod() File "test/blocksparse_conv_test.py", line 126, in testBlocksparseConv op = bs_conv_op(devF, devI) File "/home/ubuntu/blocksparse/blocksparse/conv.py", line 511, in __call__ dimF=F.get_shape().as_list(), fshare=self.fshared, bshare=self.bshared, debug=self.debug File "<string>", line 471, in blocksparse_conv File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op op_def=op_def) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__ self._traceback = tf_stack.extract_stack() InternalError (see above for traceback): device kernel image is invalid [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]] [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]] Es ====================================================================== ERROR: testBlocksparseConv (__main__.BlocksparseConvTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: device kernel image is invalid [[{{node test1/F4B4/BlocksparseConv}}]] [[{{node test1/F4B4/BlocksparseConv}}]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "test/blocksparse_conv_test.py", line 127, in testBlocksparseConv devO = sess.run( op ) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/test_util.py", line 1368, in run return super(ErrorLoggingSession, self).run(*args, **kwargs) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: device kernel image is invalid [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]] [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]] Caused by op 'test1/F4B4/BlocksparseConv', defined at: File "test/blocksparse_conv_test.py", line 213, in <module> tf.test.main() File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/test.py", line 64, in main return _googletest.main(argv) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 100, in main benchmark.benchmarks_main(true_main=main_wrapper) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/benchmark.py", line 371, in benchmarks_main true_main() File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 99, in main_wrapper return app.run(main=g_main, argv=args) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 70, in g_main return unittest_main(argv=argv) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 101, in __init__ self.runTests() File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 271, in runTests self.result = testRunner.run(self.test) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/runner.py", line 176, in run test(result) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__ return self.run(*args, **kwds) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run test(result) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__ return self.run(*args, **kwds) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run test(result) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 676, in __call__ return self.run(*args, **kwds) File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 628, in run testMethod() File "test/blocksparse_conv_test.py", line 126, in testBlocksparseConv op = bs_conv_op(devF, devI) File "/home/ubuntu/blocksparse/blocksparse/conv.py", line 511, in __call__ dimF=F.get_shape().as_list(), fshare=self.fshared, bshare=self.bshared, debug=self.debug File "<string>", line 471, in blocksparse_conv File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op op_def=op_def) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__ self._traceback = tf_stack.extract_stack() InternalError (see above for traceback): device kernel image is invalid [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]] [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]] ---------------------------------------------------------------------- Ran 2 tests in 0.231s FAILED (errors=1, skipped=1)
Besides, invalid memory access sometimes happens when running
examples/simples.py
. Here is the output without error.(tf13) ubuntu@xxx:~/blocksparse$ python examples/simple.py /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) WARNING:tensorflow:From /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. 2020-07-19 15:23:58.994318: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2020-07-19 15:23:59.016917: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz 2020-07-19 15:23:59.017474: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56341a1f6330 executing computations on platform Host. Devices: 2020-07-19 15:23:59.017505: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined> 2020-07-19 15:23:59.122639: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-19 15:23:59.123458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59 pciBusID: 0000:00:1e.0 totalMemory: 14.75GiB freeMemory: 14.65GiB 2020-07-19 15:23:59.123478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2020-07-19 15:23:59.152687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-07-19 15:23:59.152724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2020-07-19 15:23:59.152735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2020-07-19 15:23:59.152835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14241 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5) 2020-07-19 15:23:59.154217: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5634190aa650 executing computations on platform CUDA. Devices: 2020-07-19 15:23:59.154239: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla T4, Compute Capability 7.5 [array([[-0.00464108, -0.00446517, -0.00446705, ..., -0.00433037, -0.00435545, -0.00431154], [ 0.00696341, 0.00687434, 0.00675924, ..., 0.00679887, 0.00693929, 0.00719775], [ 0.01524079, 0.01537668, 0.01533529, ..., 0.01533816, 0.01512151, 0.01528387], ..., [-0.00238256, -0.00245797, -0.0022754 , ..., -0.00224203, -0.00239737, -0.00237827], [-0.00508011, -0.00536294, -0.00516913, ..., -0.00537378, -0.00533525, -0.00540836], [ 0.01230985, 0.01257054, 0.01233936, ..., 0.01226609, 0.012429 , 0.01214379]], dtype=float32)]
And here is the output when the error appears.
(tf13) ubuntu@xxx:~/blocksparse$ python examples/simple.py /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) WARNING:tensorflow:From /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. 2020-07-19 15:24:31.054902: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2020-07-19 15:24:31.076918: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz 2020-07-19 15:24:31.077469: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56258d1e4480 executing computations on platform Host. Devices: 2020-07-19 15:24:31.077494: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined> 2020-07-19 15:24:31.176438: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-19 15:24:31.177252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59 pciBusID: 0000:00:1e.0 totalMemory: 14.75GiB freeMemory: 14.65GiB 2020-07-19 15:24:31.177274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2020-07-19 15:24:31.208119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-07-19 15:24:31.208164: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2020-07-19 15:24:31.208176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2020-07-19 15:24:31.208278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14241 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5) 2020-07-19 15:24:31.209716: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56258c098570 executing computations on platform CUDA. Devices: 2020-07-19 15:24:31.209739: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla T4, Compute Capability 7.5 2020-07-19 15:24:31.685492: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered 2020-07-19 15:24:31.685539: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1 Aborted (core dumped)
I guess that those problems are due to my TensorFlow and CUDA version. Could anyone help me? Thanks a lot!
@xuyifangreeneyes I successfully run the docker container through this issuseisssue, however when running simple.py after installing blocksparse I had the same problem.I just changed the hidden_size in the simple.py to 4096*2,it crashed.I wonder if you found a solution.
System information
Encountered problem I tried both
pip install blocksparse
and building from source. After installation, I can runimport blocksparse
in Python and pass most tests. However, when I runtest/blocksparse_conv_test.py
, the following error occurred.Besides, invalid memory access sometimes happens when running
examples/simples.py
. Here is the output without error.And here is the output when the error appears.
I guess that those problems are due to my TensorFlow and CUDA version. Could anyone help me? Thanks a lot!