undertherain / benchmarker

modular framework for [not only] deep learning performance benchmarking
http://blackbird.pw/performance
Mozilla Public License 2.0
9 stars 5 forks source link

resnet50 fails with tensorflow-2.4.0 #171

Open shwetasalaria opened 3 years ago

shwetasalaria commented 3 years ago

Environment: tensorflow-2.4.0, cuda-11.0, cudnn-8.0 Command: python3 -m benchmarker --framework=tensorflow --problem=resnet50 --problem_size=512 --batch_size=32 --mode=training --nb_epoch=10 --gpu=0

2021-04-30 21:12:17.979104: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-04-30 21:12:21.513158: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-04-30 21:12:21.514769: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-04-30 21:12:23.221432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:02:00.0 name: GeForce RTX 2070 computeCapability: 7.5 coreClock: 1.62GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s 2021-04-30 21:12:23.221488: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-04-30 21:12:23.225541: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-04-30 21:12:23.225609: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-04-30 21:12:23.227289: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-04-30 21:12:23.227600: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-04-30 21:12:23.231434: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-04-30 21:12:23.232264: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-04-30 21:12:23.233020: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-04-30 21:12:23.234373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-04-30 21:12:23.235786: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-04-30 21:12:23.239937: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-04-30 21:12:23.240661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:02:00.0 name: GeForce RTX 2070 computeCapability: 7.5 coreClock: 1.62GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s 2021-04-30 21:12:23.240698: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-04-30 21:12:23.240725: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-04-30 21:12:23.240746: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-04-30 21:12:23.240767: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-04-30 21:12:23.240788: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-04-30 21:12:23.240808: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-04-30 21:12:23.240828: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-04-30 21:12:23.240849: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-04-30 21:12:23.242094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-04-30 21:12:23.242143: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-04-30 21:12:23.894649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-04-30 21:12:23.894725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-04-30 21:12:23.894739: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-04-30 21:12:23.896958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7274 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:02:00.0, compute capability: 7.5) 2021-04-30 21:12:27.374050: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2021-04-30 21:12:27.374783: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2299995000 Hz 2021-04-30 21:12:31.499036: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-04-30 21:12:31.988404: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-04-30 21:12:31.998561: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-04-30 21:12:34.285746: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_grad_filter_ops.cc:1095 : Not found: No algorithm worked! 2021-04-30 21:12:34.330271: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_grad_filter_ops.cc:1095 : Not found: No algorithm worked! 2021-04-30 21:12:34.546668: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_grad_filter_ops.cc:1095 : Not found: No algorithm worked! 2021-04-30 21:12:34.584786: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_grad_filter_ops.cc:1095 : Not found: No algorithm worked! 2021-04-30 21:12:34.784376: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_grad_filter_ops.cc:1095 : Not found: No algorithm worked! 2021-04-30 21:12:34.831404: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_grad_filter_ops.cc:1095 : Not found: No algorithm worked! 2021-04-30 21:12:35.128160: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_grad_filter_ops.cc:1095 : Not found: No algorithm worked! Traceback (most recent call last): File "/home/shweta/install/python-3.7.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/shweta/install/python-3.7.7/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/shweta/dl/alex/website/bench/benchmarker/benchmarker/benchmarker.py", line 92, in run(sys.argv[1:]) File "/home/shweta/dl/alex/website/bench/benchmarker/benchmarker/benchmarker.py", line 86, in run benchmark.measure_power_and_run() File "/home/shweta/dl/alex/website/bench/benchmarker/benchmarker/frameworks/i_benchmark.py", line 18, in measure_power_and_run results = self.run() File "/home/shweta/dl/alex/website/bench/benchmarker/benchmarker/frameworks/do_tensorflow.py", line 94, in run verbose=1, File "/home/shweta/install/python-3.7.7/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1100, in fit tmp_logs = self.train_function(iterator) File "/home/shweta/install/python-3.7.7/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 828, in call result = self._call(*args, *kwds) File "/home/shweta/install/python-3.7.7/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call return self._stateless_fn(args, **kwds) File "/home/shweta/install/python-3.7.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2943, in call filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access File "/home/shweta/install/python-3.7.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/home/shweta/install/python-3.7.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 560, in call ctx=ctx) File "/home/shweta/install/python-3.7.7/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.NotFoundError: No algorithm worked! [[node gradient_tape/resnet50/conv5_block1_0_conv/Conv2D/Conv2DBackpropFilter (defined at /dl/alex/website/bench/benchmarker/benchmarker/frameworks/do_tensorflow.py:94) ]] [Op:__inference_train_function_8521]

Function call stack: train_function