ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.91k stars 5.76k forks source link

[Tune]Ray Tune cannot use GPU after first iteration #11559

Closed fshriver closed 4 years ago

fshriver commented 4 years ago

What is the problem?

When running Ray Tune to try and optimize some hyperparameter, it's apparently able to train for one iteration (one set of epochs) using TensorFlow however appears to choke up on subsequent iterations. The following is an example of the logs that Ray/TensorFlow generate, from start to finish, from the example script (also below) that reproduces the issue:

2020-10-22 13:18:56,820 WARNING worker.py:678 -- OMP_NUM_THREADS=16 is set, this may impact object transfer performance.
2020-10-22 13:18:56,820 WARNING worker.py:682 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
2020-10-22 13:18:57.125682: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-10-22 13:18:57.856547: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.7
2020-10-22 13:18:57.858062: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.7
== Status ==
Memory usage on this node: 60.5/604.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 1/176 CPUs, 6.0/6 GPUs, 0.0/538.87 GiB heap, 0.0/12.84 GiB objects
Result logdir: /scratch/groups/fshriver/fake_test/ray-results/test
Number of trials: 10 (1 RUNNING, 9 PENDING)
+----------------+----------+-------+
| Trial name     | status   | loc   |
|----------------+----------+-------|
| model_aad5dd70 | RUNNING  |       |
| model_aad6082c | PENDING  |       |
| model_aad63054 | PENDING  |       |
| model_aad657c8 | PENDING  |       |
| model_aad67f1e | PENDING  |       |
| model_aad6a6c4 | PENDING  |       |
| model_aad6ce1a | PENDING  |       |
| model_aad6f548 | PENDING  |       |
| model_aad71e42 | PENDING  |       |
| model_aad7524a | PENDING  |       |
+----------------+----------+-------+

(pid=88109) 2020-10-22 13:18:59.346244: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
(pid=88109) 2020-10-22 13:19:00.053185: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.7
(pid=88109) 2020-10-22 13:19:00.054602: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.7
(pid=88109) 2020-10-22 13:19:00.881253: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
(pid=88109) 2020-10-22 13:19:01.427808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties: 
(pid=88109) pciBusID: 0035:04:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
(pid=88109) coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
(pid=88109) 2020-10-22 13:19:01.430577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 1 with properties: 
(pid=88109) pciBusID: 0035:05:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
(pid=88109) coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
(pid=88109) 2020-10-22 13:19:01.433288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 2 with properties: 
(pid=88109) pciBusID: 0035:03:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
(pid=88109) coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
(pid=88109) 2020-10-22 13:19:01.435958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 3 with properties: 
(pid=88109) pciBusID: 0004:06:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
(pid=88109) coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
(pid=88109) 2020-10-22 13:19:01.438677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 4 with properties: 
(pid=88109) pciBusID: 0004:05:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
(pid=88109) coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
(pid=88109) 2020-10-22 13:19:01.441319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 5 with properties: 
(pid=88109) pciBusID: 0004:04:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
(pid=88109) coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
(pid=88109) 2020-10-22 13:19:01.441346: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
(pid=88109) 2020-10-22 13:19:01.441398: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
(pid=88109) 2020-10-22 13:19:01.442728: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
(pid=88109) 2020-10-22 13:19:01.443151: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
(pid=88109) 2020-10-22 13:19:01.444621: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
(pid=88109) 2020-10-22 13:19:01.445725: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
(pid=88109) 2020-10-22 13:19:01.445760: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
(pid=88109) 2020-10-22 13:19:01.478011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
(pid=88109) 2020-10-22 13:19:01.496056: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
(pid=88109) 2020-10-22 13:19:01.503728: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x15b9fac30 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
(pid=88109) 2020-10-22 13:19:01.503758: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
(pid=88109) 2020-10-22 13:19:02.564658: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties: 
(pid=88109) pciBusID: 0035:04:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
(pid=88109) coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
(pid=88109) 2020-10-22 13:19:02.567461: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 1 with properties: 
(pid=88109) pciBusID: 0035:05:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
(pid=88109) coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
(pid=88109) 2020-10-22 13:19:02.570192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 2 with properties: 
(pid=88109) pciBusID: 0035:03:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
(pid=88109) coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
(pid=88109) 2020-10-22 13:19:02.572863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 3 with properties: 
(pid=88109) pciBusID: 0004:06:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
(pid=88109) coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
(pid=88109) 2020-10-22 13:19:02.575574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 4 with properties: 
(pid=88109) pciBusID: 0004:05:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
(pid=88109) coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
(pid=88109) 2020-10-22 13:19:02.578225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 5 with properties: 
(pid=88109) pciBusID: 0004:04:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
(pid=88109) coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
(pid=88109) 2020-10-22 13:19:02.578264: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
(pid=88109) 2020-10-22 13:19:02.578282: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
(pid=88109) 2020-10-22 13:19:02.578305: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
(pid=88109) 2020-10-22 13:19:02.578321: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
(pid=88109) 2020-10-22 13:19:02.578337: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
(pid=88109) 2020-10-22 13:19:02.578354: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
(pid=88109) 2020-10-22 13:19:02.578368: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
(pid=88109) 2020-10-22 13:19:02.610545: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
(pid=88109) 2020-10-22 13:19:02.610587: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
(pid=88109) 2020-10-22 13:19:06.925420: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1099] Device interconnect StreamExecutor with strength 1 edge matrix:
(pid=88109) 2020-10-22 13:19:06.925457: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105]      0 1 2 3 4 5 
(pid=88109) 2020-10-22 13:19:06.925466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1118] 0:   N Y Y Y Y Y 
(pid=88109) 2020-10-22 13:19:06.925473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1118] 1:   Y N Y Y Y Y 
(pid=88109) 2020-10-22 13:19:06.925480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1118] 2:   Y Y N Y Y Y 
(pid=88109) 2020-10-22 13:19:06.925486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1118] 3:   Y Y Y N Y Y 
(pid=88109) 2020-10-22 13:19:06.925492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1118] 4:   Y Y Y Y N Y 
(pid=88109) 2020-10-22 13:19:06.925498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1118] 5:   Y Y Y Y Y N 
(pid=88109) 2020-10-22 13:19:06.948332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1244] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14756 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0035:04:00.0, compute capability: 7.0)
(pid=88109) 2020-10-22 13:19:06.954053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1244] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14756 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0035:05:00.0, compute capability: 7.0)
(pid=88109) 2020-10-22 13:19:06.959682: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1244] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14756 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0035:03:00.0, compute capability: 7.0)
(pid=88109) 2020-10-22 13:19:06.965200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1244] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14756 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0004:06:00.0, compute capability: 7.0)
(pid=88109) 2020-10-22 13:19:06.970724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1244] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 14756 MB memory) -> physical GPU (device: 4, name: Tesla V100-SXM2-16GB, pci bus id: 0004:05:00.0, compute capability: 7.0)
(pid=88109) 2020-10-22 13:19:06.976287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1244] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 14756 MB memory) -> physical GPU (device: 5, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0)
(pid=88109) 2020-10-22 13:19:06.979585: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1a488b8d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
(pid=88109) 2020-10-22 13:19:06.979596: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
(pid=88109) 2020-10-22 13:19:06.979603: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla V100-SXM2-16GB, Compute Capability 7.0
(pid=88109) 2020-10-22 13:19:06.979610: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Tesla V100-SXM2-16GB, Compute Capability 7.0
(pid=88109) 2020-10-22 13:19:06.979616: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): Tesla V100-SXM2-16GB, Compute Capability 7.0
(pid=88109) 2020-10-22 13:19:06.979622: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (4): Tesla V100-SXM2-16GB, Compute Capability 7.0
(pid=88109) 2020-10-22 13:19:06.979628: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (5): Tesla V100-SXM2-16GB, Compute Capability 7.0
(pid=88109) 2020-10-22 13:19:07.980132: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-10-22 13:19:08.345017: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-10-22 13:19:08.433764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties: 
pciBusID: 0004:04:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2020-10-22 13:19:08.436445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 1 with properties: 
pciBusID: 0004:05:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2020-10-22 13:19:08.439094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 2 with properties: 
pciBusID: 0004:06:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2020-10-22 13:19:08.441855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 3 with properties: 
pciBusID: 0035:03:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2020-10-22 13:19:08.443951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 4 with properties: 
pciBusID: 0035:04:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2020-10-22 13:19:08.446664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 5 with properties: 
pciBusID: 0035:05:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2020-10-22 13:19:08.446696: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-10-22 13:19:08.446760: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-10-22 13:19:08.447994: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-10-22 13:19:08.448375: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-10-22 13:19:08.449879: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-10-22 13:19:08.450954: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-10-22 13:19:08.450996: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-10-22 13:19:08.481797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-10-22 13:19:08.498369: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-10-22 13:19:08.505830: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x165aa5680 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-22 13:19:08.505854: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-22 13:19:08.506335: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
*** Aborted at 1603387148 (unix time) try "date -d @1603387148" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x38c80001a3a2) received by PID 107426 (TID 0x2000000456c0) from PID 107426; stack trace: ***
    @     0x200049d63484 google::(anonymous namespace)::FailureSignalHandler()
    @     0x2000000504d8 ([vdso]+0x4d7)
    @     0x200000102094 __GI_abort
    @     0x20053de4e44c tensorflow::internal::LogMessageFatal::~LogMessageFatal()
    @     0x20053dcf04cc stream_executor::port::internal_statusor::Helper::Crash()
    @     0x20005d0cd5f4 tensorflow::BaseGPUDeviceFactory::EnablePeerAccess()
    @     0x20005d0d52bc tensorflow::BaseGPUDeviceFactory::CreateDevices()
    @     0x20005d12ff44 tensorflow::DeviceFactory::AddDevices()
    @     0x200533470958 TFE_NewContext
    @     0x20053277e164 _wrap_TFE_NewContext
    @        0x127c35480 _PyCFunction_FastCallDict
    @        0x127c81c64 _PyCFunction_FastCallKeywords
    @        0x127d1f82c call_function
    @        0x127d5a5a4 _PyEval_EvalFrameDefault
    @        0x127c12bc4 PyEval_EvalFrameEx
    @        0x127d152bc fast_function
    @        0x127d1f9ac call_function
    @        0x127d5a5a4 _PyEval_EvalFrameDefault
    @        0x127c12bc4 PyEval_EvalFrameEx
    @        0x127d152bc fast_function
    @        0x127d1f9ac call_function
    @        0x127d5a5a4 _PyEval_EvalFrameDefault
    @        0x127c12bc4 PyEval_EvalFrameEx
    @        0x127d152bc fast_function
    @        0x127d1f9ac call_function
    @        0x127d5a5a4 _PyEval_EvalFrameDefault
    @        0x127c12bc4 PyEval_EvalFrameEx
    @        0x127d152bc fast_function
    @        0x127d1f9ac call_function
    @        0x127d5a5a4 _PyEval_EvalFrameDefault
    @        0x127c12bc4 PyEval_EvalFrameEx
    @        0x127d13b28 _PyEval_EvalCodeWithName
Aborted

As you can see, the main culprit appearst to be from:

Attempting to fetch value instead of handling error Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal

which, if I were to run multiple trials at the same time, is an error message that gets repeated on all running processes before the task is aborted. I have confirmed in a separate test that the issue is not with TensorFlow itself, as I am able to make my own network similar to the one in the example script and train it for however many epochs I want just fine. And if I turn on verbose TensorFlow output in the example script shown below, I actually see that TensorFlow does manage to finish one set of epochs via Ray - apparently the issue gets introduced when Ray finishes its initial evaluation and tries to run the trial again.

Ray version: 0.8.1, TF version: 2.1.0, OS: RHEL-7.6, System: IBM POWER9(ppcle64).

Worth noting, I can't update to the latest version of Ray or a different version of TF since Ray currently isn't built for the POWER9 architecture and IBM doesn't want to fully support their ML libraries. Limitation of the system I'm working with.

Other environment information (output from conda list):

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_py-xgboost-mutex         1.0             gpu_645.ge505a9a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
_pytorch_select           2.0             gpu_21932.g39c5d28    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
_tflow_select             2.1.0           gpu_915.g4f6e601    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
absl-py                   0.8.1                    py36_0  
apex                      0.1.0_1.7.0     py36_655.g8cb96a0    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
arrow-cpp                 0.15.1          py36_652.g2ced9b2    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
astor                     0.8.0                    py36_0  
atomicwrites              1.4.0                      py_0  
attrs                     19.3.0                     py_0  
bazel                     0.29.1             671.g8ceea09    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
binutils_impl_linux-ppc64le 2.31.1               he53550c_1  
binutils_linux-ppc64le    2.31.1               he53550c_8  
blas                      1.0                    openblas  
bokeh                     2.0.2                    py36_0  
boost                     1.67.0                   py36_4  
boost-cpp                 1.67.0               h14c3975_4  
brotli                    1.0.7                he6710b0_0  
bzip2                     1.0.8                h7b6447c_0  
c-ares                    1.15.0            h7b6447c_1001  
ca-certificates           2020.7.22                     0  
cachetools                4.1.1                    pypi_0    pypi
caffe                     1.0_1.7.0         5243.gc912bce    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
caffe-base                1.0_1.7.0       gpu_py36_5243.gc912bce    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
cairo                     1.14.12              h8948797_3  
certifi                   2020.6.20                py36_0  
cffi                      1.12.3           py36h2e261b9_0  
chardet                   3.0.4                 py36_1003  
click                     7.0                      py36_0  
cloudpickle               1.2.2                      py_0  
cmake                     3.14.0               h52cb24c_0  
colorama                  0.4.3                      py_0  
coverage                  4.5.4            py36h7b6447c_0  
cryptography              2.9.2            py36h1ba5d50_0  
cudatoolkit               10.2.89            680.g0f7a43a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit-compat        10.2.89             70.g52c6df3    file:///sw/sources/ibm-wml-ce/conda-channel
cudatoolkit-dev           10.2.89            680.g0f7a43a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
cudf                      0.11.0          cuda10.2_py36_676.g765efe2    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
cudnn                     7.6.5_10.2         650.g338a052    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
cuml                      0.11.0          cuda10.2_py36_663.g2f1335f    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
cupy                      6.6.0           py36_624.gd34c158    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
cxxfilt                   0.2.0           py_622.gbc2955e    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
cycler                    0.10.0                   py36_0  
cython                    0.29.17          py36he6710b0_0  
cytoolz                   0.10.1           py36h7b6447c_0  
dali                      0.18            py36_e10a365_1773.gda5b3a6    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
dask                      2.9.2                      py_0  
dask-core                 2.9.2                      py_0  
dask-cuda                 0.11.0          py36_630.g2290d8e    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
dask-cudf                 0.11.0          py36_631.g6fc57e1    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
dask-xgboost              0.1.9           py36_647.g2eb49b6    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
ddl                       1.5.1.3         py36_757_SUMMIT.g959095f    file:///sw/sources/ibm-wml-ce/conda-channel
ddl-tensorflow            1.5.1           py36_1073.g105e407    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
decorator                 4.4.2                      py_0  
distributed               2.9.3                      py_0  
dlpack                    0.2                616.g28dffd9    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
double-conversion         3.1.5              623.g15aab6a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
expat                     2.2.6                he6710b0_0  
fastavro                  0.22.7          py36_621.ga06cd9d    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
fastrlock                 0.4              py36he6710b0_0  
ffmpeg                    4.2.2                h20bf706_0  
filelock                  3.0.12                     py_0  
fontconfig                2.13.0               h9420a91_0  
freeglut                  3.0.0                hf484d3e_5  
freetype                  2.9.1                h8a8886c_0  
fsspec                    0.6.2                      py_0  
funcsigs                  1.0.2                    py36_0  
future                    0.17.1                   py36_0  
gast                      0.2.2                    py36_0  
gcc_impl_linux-ppc64le    7.3.0                he01c8ba_1  
gcc_linux-ppc64le         7.3.0                h48e019a_8  
gettext                   0.19.8.1             h97d3a84_3  
gflags                    2.2.2             1679.g1bcb8ab    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
giflib                    5.1.4                h14c3975_1  
glib                      2.63.1               h5a9c865_0  
glog                      0.3.5             1668.g110e904    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
gmp                       6.1.2                h7f7056e_2  
gnutls                    3.6.5             h71b1129_1002  
google-pasta              0.1.8           py36_622.gd00f35a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
graphite2                 1.3.13               h23475e2_0  
graphsurgeon              0.4.1           py36_690.g29ffa96    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
grpc-cpp                  1.26.0             624.gf93cc79    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
grpcio                    1.16.1           py36hf8bcb03_1  
gxx_impl_linux-ppc64le    7.3.0                h822a55f_1  
gxx_linux-ppc64le         7.3.0                h48e019a_8  
h5pickle                  0.4.2                    pypi_0    pypi
h5py                      2.8.0            py36h8d01980_0  
harfbuzz                  1.8.8                hffaf4a1_0  
hdf5                      1.10.2               hba1933b_1  
heapdict                  1.0.1                      py_0  
horovod                   0.19.0          py36_1101.g9b31c6e    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
hyperopt                  0.2.5                    pypi_0    pypi
hypothesis                3.59.1           py36h39e3cac_0  
icu                       58.2                 he6710b0_3  
idna                      2.8                      py36_0  
imageio                   2.8.0                      py_0  
importlib_metadata        1.5.0                    py36_0  
jasper                    2.0.14               h07fcdf6_1  
jinja2                    2.11.2                     py_0  
joblib                    0.13.2                   py36_0  
jpeg                      9b                   hcb7ba68_2  
jpeg-turbo                2.0.4              644.gdc96f1a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
jsonschema                3.2.0                    py36_1  
keras-applications        1.0.8                      py_0  
keras-base                2.3.1           py36_682.gabf4d2a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
keras-gpu                 2.3.1              682.gabf4d2a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
keras-preprocessing       1.1.0                      py_1  
kiwisolver                1.2.0            py36hfd86e86_0  
krb5                      1.17.1               h597af5e_0  
lame                      3.100                h7b6447c_0  
ld_impl_linux-ppc64le     2.33.1               h0f24833_7  
leveldb                   1.20                 hf484d3e_1  
libboost                  1.67.0               h46d08c1_4  
libcudf                   0.11.0          cuda10.2_659.g7f5e265    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
libcuml                   0.11.0          cuda10.2_632.ga47fed3    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
libcurl                   7.69.1               h20c2e04_0  
libedit                   3.1.20181209         hc058e9b_0  
libevent                  2.1.8              619.g85af581    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
libffi                    3.2.1                hf62a594_5  
libflac                   1.3.1              619.g8a0731d    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
libgcc-ng                 8.2.0                h822a55f_1  
libgfortran-ng            7.3.0                h822a55f_1  
libglu                    9.0.0                hf484d3e_1  
libnvstrings              0.11.0          cuda10.2_628.g7e96cde    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
libogg                    1.3.2                h7b6447c_0  
libopenblas               0.3.6                h5a2b251_1  
libopencv                 3.4.8           py36_784.g5a642ca    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
libopus                   1.3.1                h7b6447c_0  
libpng                    1.6.36               hbc83047_0  
libprotobuf               3.8.0              634.g08dc819    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
librmm                    0.11.0          cuda10.2_624.gca9adfe    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
libsndfile                1.0.28             617.g5711ca6    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
libssh2                   1.9.0                h1ba5d50_1  
libstdcxx-ng              8.2.0                h822a55f_1  
libtiff                   4.1.0                h2733197_0  
libuuid                   1.0.3                h1bed415_2  
libvorbis                 1.3.6                h7b6447c_0  
libvpx                    1.7.0                hf484d3e_0  
libwebp                   1.0.1                h8e7db2f_0  
libxcb                    1.13                 h1bed415_0  
libxgboost-base           0.90            gpu_645.ge505a9a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
libxml2                   2.9.9                hea5a465_1  
llvmlite                  0.31.0           py36hd408876_0  
lmdb                      0.9.22               hf484d3e_1  
locket                    0.2.0                    py36_1  
lz4-c                     1.8.1.2              h14c3975_0  
magma                     2.5.2             1642.g9d81041    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
make                      4.2.1                h14c3975_1  
markdown                  3.1.1                    py36_0  
markupsafe                1.1.1            py36h7b6447c_0  
matplotlib                3.0.3            py36h5429711_0  
mock                      3.0.5                    py36_0  
more-itertools            8.2.0                      py_0  
msgpack-python            1.0.0            py36hfd86e86_1  
nccl                      2.5.6              645.g51c2e94    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
ncurses                   6.2                  he6710b0_1  
nettle                    3.4.1                hbb512f6_0  
networkx                  2.3                        py_0  
ninja                     1.9.0            py36hfd86e86_0  
nomkl                     3.0                           0  
numactl                   2.0.12             628.gb5e1afd    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
numba                     0.47.0           py36h962f231_0  
numpy                     1.17.4           py36hd5be1e1_0  
numpy-base                1.17.4           py36h2f8d375_0  
nvstrings                 0.11.0          cuda10.2_py36_637.g9163edb    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
olefile                   0.46                     py36_0  
onnx                      1.6.0           py36_671.g75d3229    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
openblas                  0.3.6                         1  
openblas-devel            0.3.6                         1  
opencv                    3.4.8           py36_784.g5a642ca    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
openh264                  2.1.0                hd408876_0  
openssl                   1.1.1h               h7b6447c_0  
opt_einsum                3.1.0                      py_0  
packaging                 20.3                       py_0  
pai4sk                    1.6.0           py36_1156.g99299fc    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
pandas                    0.24.2           py36he6710b0_0  
parquet-cpp               1.5.1              629.g650bfd0    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
partd                     1.1.0                      py_0  
pciutils                  3.6.2              627.g804ec60    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
pcre                      8.43                 he6710b0_0  
pillow                    7.0.0            py36haac5956_0  
pip                       20.0.2                   py36_3  
pixman                    0.34.0               h1f8d8dc_3  
pluggy                    0.13.1                   py36_0  
powerai                   1.7.0              679.g5b5a006    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
powerai-license           1.7.0           169_SUMMIT.ga16f7c6    file:///sw/sources/ibm-wml-ce/conda-channel
powerai-rapids            1.7.0              616.g6689446    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
powerai-release           1.7.0              627.g1c389a2    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
powerai-tools             1.7.0              623.g843ad38    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
protobuf                  3.8.0           py36_642.gdc7b773    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
psutil                    5.6.7            py36h7b6447c_0  
py                        1.8.1                      py_0  
py-boost                  1.67.0           py36h04863e7_4  
py-opencv                 3.4.8           py36_784.g5a642ca    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
py-xgboost-base           0.90            gpu_py36_645.ge505a9a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
py-xgboost-gpu            0.90               645.ge505a9a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
pyarrow                   0.15.1          py36_657.gfd1820f    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
pycparser                 2.20                       py_0  
pynvml                    8.0.3           py36_618.g443d2aa    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
pyopenssl                 19.1.0                   py36_0  
pyparsing                 2.4.7                      py_0  
pyrsistent                0.17.3           py36h7b6447c_0  
pysocks                   1.7.1                    py36_0  
pytest                    4.4.2                    py36_0  
python                    3.6.10               ha29dc6b_1  
python-dateutil           2.8.1                      py_0  
python-lmdb               0.98             py36he6710b0_0  
pytorch                   1.3.1            21932.g39c5d28    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
pytorch-base              1.3.1           gpu_py36_21932.g39c5d28    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
pytz                      2020.1                     py_0  
pywavelets                1.1.1            py36h7b6447c_0  
pywget                    3.2                      py36_0  
pyyaml                    5.1.2            py36h7b6447c_0  
ray                       0.8.1            py36h1d8a796_1    powerai
re2                       2019.08.01         619.g030686e    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
readline                  8.0                  h7b6447c_0  
redis-py                  3.5.3                      py_0  
requests                  2.22.0                   py36_1  
rhash                     1.3.8                h1ba5d50_0  
rmm                       0.11.0          cuda10.2_py36_626.g7c0a2df    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
scikit-image              0.15.0           py36he6710b0_0  
scikit-learn              0.22.1           py36h22eb022_0  
scipy                     1.3.1            py36he2b7bc3_0  
setuptools                46.2.0                   py36_0  
shortuuid                 1.0.1                    pypi_0    pypi
simsearch                 1.6.0           py36_882.ga3f4a67    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
six                       1.13.0                   py36_0  
snapml-spark              1.6.0           py_1020.gc01d7a8    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
snappy                    1.1.7                h1532aa0_3  
sortedcontainers          2.1.0                    py36_0  
spectrum-mpi              10.03           64_SUMMIT.g3947c96    file:///sw/sources/ibm-wml-ce/conda-channel
sqlite                    3.31.1               hbc83047_1  
tabulate                  0.8.2                    py36_0  
tbb                       2020.0               hfd86e86_0  
tblib                     1.6.0                      py_0  
tensorboard               2.1.0           py36_3dc74fe_3941.g4f6e601    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tensorflow                2.1.0           gpu_py36_915.g4f6e601    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tensorflow-base           2.1.0           gpu_py36_e5bf8de_72635.gf8ef88c    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tensorflow-benchmarks     0.1             gpu_py_619.g50be9d1    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tensorflow-estimator      2.1.0           py36_7ec4e5d_1463.g4f6e601    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tensorflow-gpu            2.1.0              915.g4f6e601    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tensorflow-probability    0.9.0           py36_356cfdd_3228.g4f6e601    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tensorflow-serving        2.1.0           gpu_655.gf3d82d3    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tensorflow-serving-api    2.1.0           py36_d83512c_5308.gf3d82d3    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tensorrt                  7.0.0.11        py36_690.g29ffa96    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tensorrt-samples          7.0.0.11        py36_690.g29ffa96    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
termcolor                 1.1.0                    py36_1  
thrift-cpp                0.12.0             635.gab3648d    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tk                        8.6.8                hbc83047_0  
toolz                     0.10.0                     py_0  
torchtext                 0.4.0           py36_633.g16a90d7    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
torchvision               0.4.2              653.g7becf3e    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
torchvision-base          0.4.2           gpu_py36_653.g7becf3e    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tornado                   6.0.4            py36h7b6447c_1  
tqdm                      4.36.1                     py_0  
typing                    3.6.4                    py36_0  
typing_extensions         3.7.4.1                  py36_0  
uff                       0.6.5           py36_690.g29ffa96    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
uriparser                 0.9.3              615.g7465fef    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
urllib3                   1.25.8                   py36_0  
werkzeug                  0.16.0                     py_0  
wheel                     0.34.2                   py36_0  
wrapt                     1.11.2           py36h7b6447c_0  
x264                      1!157.20191217       h7b6447c_0  
xz                        5.2.5                h7b6447c_0  
yaml                      0.1.7                h1bed415_2  
zict                      2.0.0                      py_0  
zipp                      3.1.0                      py_0  
zlib                      1.2.11               h7b6447c_3  
zstd                      1.3.7                h0b5b093_0

Reproduction (REQUIRED)

import numpy as np
import ray
from ray import tune
from ray.tune import Trainable

x1 = np.random.uniform(size = (10,20))
y1 = np.random.uniform(size = (10, 2))

ray.init(address = 'auto', redis_password = '5241590000000000')
x1_id = ray.put(x1)
y1_id = ray.put(y1)

def create_model(config):
    import tensorflow as tf
    from tensorflow.keras import Input, Model
    from tensorflow.keras.layers import Dense
    from tensorflow.keras.optimizers import Adam
    input_x = Input(shape = (20))
    x = Dense(units = 100, activation = 'linear')(input_x)
    x = Dense(units = 200, activation = 'linear')(x)
    y = Dense(units = 2, activation = 'linear')(x)
    model = Model(inputs = input_x, outputs = y)
    model.compile(loss = 'mse', optimizer = 'adam')
    return model

class model(Trainable):

    def _setup(self, config):
        self.config = config
        self.x1 = ray.get(x1_id)
        self.y1 = ray.get(y1_id)

        self.model = create_model(self.config)

    def _train(self):
        self.model.fit(self.x1, self.y1, epochs = 10, verbose = 0)

        predictions = self.model.predict(self.x1)
        rmse = np.mean(predictions)

        return {'mean_loss': rmse}

search_space = {'nonsense-hyperparam': tune.uniform(0.1, 0.9)}

analysis = tune.run(model, name = 'test', local_dir = './ray-results/', verbose = 1, num_samples = 10, resources_per_trial = {'gpu': 6.0}, stop = {'training_iteration': 5})
richardliaw commented 4 years ago

@fshriver hmm sorry to hear about your setup. Thanks for putting together such a detailed report. Can you enable verbose output from tensorflow and run this again? Also, what if you simply try using 1 GPU per trial instead of 6?

fshriver commented 4 years ago

Sure, I've attached the output of the program with verbose = 1 as a text file here. As you can see, it's actually able to see and use the GPUs successfully for the first iteration... it looks like after that is where the issue starts. Perhaps some sort of lock on the resources that the underlying system/Ray don't like? I'm really not familiar enough with the Ray internals to say if that's the case, however.

Also, I'm using 6 GPUs purely because if I set the usage to 1 GPU I get the same log messages you see above, just repeated across 6 different processes. The 6 GPU requirement is just to fill up the node so it captures only one error message.

fshriver commented 4 years ago

So I've worked with my cluster's support group and one of them suggested NVIDIA's CUDA Multi-Processing Service (MPS) is to blame; specifically, it isn't enabled on our compute nodes by default, which I didn't know about. Enabling it appears to make the above issue go away. If someone is looking back on this error message in the future, the issue is likely due to some issue with the CUDA Multi-Processing Service; check there!

If there are no objections, I'll be closing this issue tomorrow since it's a non-issue.