Open xanderdunn opened 3 years ago
do you have cudnn installed as well?
In the first output above I do, yes. I have cudnn-7 installed. The first example is the machine I use for all of my training runs. It is fully functional with tensorflow on GPU. To make sure, I sudo rm -r /usr/local/cuda
and re-installed CUDA 10.2 and cuddn-7. Same result.
In the second example where I have no GPU and no CUDA, cudnn is not installed. I am expecting all of my tests to run on CPU TF_EAGER on the GitHub Actions continuous integration machine.
Ok, there are two different issues here:
This was actually a very opaque error occurring in a specific unit test because my test was calling model.callAsFunction(_ input: Tensor<Float>)
, but I didn't actually have that implemented. To support both continuous and categorical inputs, my model instead implements a custom protocol SparseAndDenseLayer
with callAsFunction(continuousInputs: Tensor<Float>, categoricalInputs: [Tensor<Int32>]) -> Tensor<Float>
, as done here in swift-models. I'm not sure why a call to model(inputs)
even compiled on a struct
that didn't implement callAsFunction(_ input)
or conform to Layer
.
The opaqueness of the error and lack of stack trace made this difficult to find. I fixed my test to call the model correctly and I no longer see the Exited with signal code 11.
V100 GPU machine, Ubuntu 18.04, CUDA 10.2, cuddn-7:
Test Case 'LayerTests.testMeanSquaredErrorOnRandomValues' passed (7.785 seconds)
Test Suite 'LayerTests' passed at 2020-12-28 07:44:19.787
Executed 1 test, with 0 failures (0 unexpected) in 7.785 (7.785) seconds
Test Suite 'Selected tests' passed at 2020-12-28 07:44:19.787
Executed 1 test, with 0 failures (0 unexpected) in 7.785 (7.785) seconds
2020-12-28 07:44:11.937355: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic librar
y libcudart.so.10.2
2020-12-28 07:44:12.543778: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneA
PI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1
SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-28 07:44:12.561256: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2199995000 Hz
2020-12-28 07:44:12.561975: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5649fd253640 initialized for platf
orm Host (this does not guarantee that XLA will be used). Devices:
2020-12-28 07:44:12.562007: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Versi
on
2020-12-28 07:44:12.563635: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic librar
y libcuda.so.1
2020-12-28 07:44:12.573900: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS $
ad negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-28 07:44:12.574754: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2020-12-28 07:44:12.574810: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcudart.so.10.2
2020-12-28 07:44:12.577960: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcublas.so.10
2020-12-28 07:44:12.580506: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcufft.so.10
2020-12-28 07:44:12.580992: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcurand.so.10
2020-12-28 07:44:12.583847: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcusolver.so.10
2020-12-28 07:44:12.585490: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcusparse.so.10
2020-12-28 07:44:12.591345: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcudnn.so.7
2020-12-28 07:44:12.591420: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS $
ad negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-28 07:44:12.591997: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS $
ad negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-28 07:44:12.592514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-12-28 07:44:13.402013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with $
trength 1 edge matrix:
2020-12-28 07:44:13.402053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2020-12-28 07:44:13.402060: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2020-12-28 07:44:13.402202: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-28 07:44:13.402816: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-28 07:44:13.403369: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-28 07:44:13.403885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13460 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:04.0, compute capability: 7.0)
2020-12-28 07:44:13.405731: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564a14cebe70 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-12-28 07:44:13.405755: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2020-12-28 07:44:13.856129: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-12-28 07:44:14.111203: I tensorflow/compiler/xla/xla_client/xrt_local_service.cc:54] Peer localservice 1 {localhost:34351}
2020-12-28 07:44:14.111446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-28 07:44:14.111467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]
2020-12-28 07:44:14.115319: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:34351}
2020-12-28 07:44:14.115770: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc://localhost:34351
2020-12-28 07:44:14.116095: I tensorflow/compiler/xla/xla_client/computation_client.cc:202] NAME: CPU:0
2020-12-28 07:44:14.116134: I tensorflow/compiler/xla/xla_client/computation_client.cc:202] NAME: GPU:0
2020-12-28 07:44:19.874952: F ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:647] Non-OK-status: GpuLaunchKernel(BlockReduceKernel<IN_T, OUT_T, num_threads, Op>, num_blocks, num_threads, 0, cu_stream, in, out, in_size, op, init) status: Internal: driver shutting down
Exited with signal code 6
This is still happening on my GPU machine at the completion of all tests, but only when I run all my unit tests in parallel with swift test --parallel
. The device is set to .default
for all tests, so I expect they're running on GPU TF_EAGER. A handful of the tests are full models that are trained to convergence on simple datasets. Is it possible that running multiple models simultaneously on GPU is causing this error?
I replaced all of the .default
devices in my unit tests with let testDevice: Device = Device(kind: Device.Kind.CPU, ordinal: 0, backend: Device.Backend.TF_EAGER)
, but the Exited with signal code 6
error still occurs at the end of all the unit tests:
2020-12-28 08:16:09.823251: I tensorflow/compiler/xla/xla_client/computation_client.cc:202] NAME: CPU:0
2020-12-28 08:16:09.823301: I tensorflow/compiler/xla/xla_client/computation_client.cc:202] NAME: GPU:0
2020-12-28 08:16:20.953706: E tensorflow/stream_executor/stream.cc:338] Error recording event in stream: Error recording CUDA
event: UNKNOWN ERROR (4); not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2020-12-28 08:16:20.953756: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: UNKNOWN ERROR (4)
2020-12-28 08:16:20.953766: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:220] Unexpected Event status: 1
Exited with signal code 6
It does not occur on my CPU-only GitHub Actions continuous integration machine.
I'm definitely not the person to ask about this but IIRC swift test --parallel
won't work on a GPU, that is correct.
Thanks @brettkoonce! Is this expected to be the case even when all Tensors and models are specified on Device.Kind.CPU
? The mere presence of a GPU breaks swift test --parallel
? Maybe it's a conflict caused during init due to Tensorflow finding a GPU?
I've been successfully running my unit tests without --parallel
so I believe this can be closed if the above is expected behavior.
Swift for Tensorflow 0.12. On an Ubuntu 18.04 machine with CUDA 10.2 installed and a V100 GPU:
On a GitHub Action Ubuntu 18.04 machine with no GPU and no CUDA installed, I get this non-zero exit:
I see this on most runs of my 50 unit tests. This is a problem because my CI builds are being marked as failed when in fact all the tests are passing.
Has anyone encountered this on the continuous integration testing of Swift for Tensorflow projects?
I didn't encounter this on Swift for Tensorflow 0.11.