Closed soundway closed 3 years ago
cc @rkooo567 @stephanie-wang any ideas here? I can't think of why this might be.
@soundway , are there logs that say why the worker is dying? Perhaps run dmesg
to see if it's some resource exhaustion error or segfault. The full logs in /tmp/ray
may also have something.
I am not aware of any known issues related to this... Seems pretty bad though. I also would like to see full logs (if it is crashed within 4~5 minutes, full logs for that cluster would be ideal).
Btw, this could be the same issue? https://github.com/ray-project/ray/issues/9293
Past logs are lost along with the instances so I will have to launch new ones. It will take a long time (and potentially many costly runs) for me to reproduce the error with QBert, but I will try. Unfortunately, I don’t think I can provide the full logs when running with our custom environments. That said, I went through several different types of *.err on the different machines, and I can show some findings here for our custom setup.
We use 1x p3.8xlarge instance as trainer, and 4x r5.24xlarge (768GB RAM) for rollouts. Each trainer spawns some number of workers that roughly occupy one r5.24xlarge instance. We run with the default PPO implementation provided by RLLib 1.0 without any modification to the policy/trainer class.
In this failed job, it took roughly 1 hour after launching. Below is the memory usage graph:
Note that for cpuworker0, the resource goes up significantly at around 03:13, which is also the time training stalls. Not all jobs that failed in the past exhibit this behavior in memory (but it could be a failure on our Grafana side to capture it). This is raylet.err on cpuworker0:
F1008 10:12:38.038853 490 521 malloc.cc:56] failed to find entry in mmap_records for fd 0
*** Check failure stack trace: ***
@ 0x5592b35fc9cd google::LogMessage::Fail()
@ 0x5592b35fdb2c google::LogMessage::SendToLog()
@ 0x5592b35fc6a9 google::LogMessage::Flush()
@ 0x5592b35fc8c1 google::LogMessage::~LogMessage()
@ 0x5592b35b4ad9 ray::RayLog::~RayLog()
@ 0x5592b3364676 plasma::GetMmapSize()
@ 0x5592b3345964 plasma::PlasmaStore::ReturnFromGet()
@ 0x5592b3345331 boost::asio::detail::wait_handler<>::do_complete()
@ 0x5592b38fa46f boost::asio::detail::scheduler::do_run_one()
@ 0x5592b38fb971 boost::asio::detail::scheduler::run()
@ 0x5592b38fc9a2 boost::asio::io_context::run()
@ 0x5592b334b1ee plasma::PlasmaStoreRunner::Start()
@ 0x7f8eac0449e0 (unknown)
@ 0x7f8eab9d16db start_thread
@ 0x7f8eab2daa0f clone
Note that 10:12:38 is the UTC time so it’s the same as the 3:12 AM time shown in the graph. It seems like the memory usage on cpuworker0 begins to rise after this error message. In its raylet.out, we see over 11K messages that are similar to this throughout the whole hour of training:
W1008 10:12:11.417868 490 533 object_manager.cc:777] ReceiveObjectChunk index 3 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: Chunk already referenced by another thread.
W1008 10:12:11.417868 490 533 object_manager.cc:777] ReceiveObjectChunk index 3 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: Chunk already referenced by another thread.
W1008 10:12:11.418118 490 537 object_manager.cc:777] ReceiveObjectChunk index 7 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: Chunk already referenced by another thread.
W1008 10:12:11.418118 490 537 object_manager.cc:777] ReceiveObjectChunk index 7 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: Chunk already referenced by another thread.
W1008 10:12:11.418407 490 531 object_manager.cc:777] ReceiveObjectChunk index 9 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: object already exists in the plasma store
W1008 10:12:11.418407 490 531 object_manager.cc:777] ReceiveObjectChunk index 9 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: object already exists in the plasma store
W1008 10:12:11.418475 490 534 object_manager.cc:777] ReceiveObjectChunk index 8 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: object already exists in the plasma store
W1008 10:12:11.418475 490 534 object_manager.cc:777] ReceiveObjectChunk index 8 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: object already exists in the plasma store
But I also see similar things in other workers' logs (though with less number of such messages around 2k - 8K).
Around this time, we see the following messages across all workers:
2020-10-08 10:13:08,003\u0009WARNING worker.py:1072 The node with node id 3ddf99132c312ccc8af79c66f826b2dc622d5ee4 has been marked dead because the detector has missed too many heartbeats from it.
I verified that that node ID is indeed associated with cpuworker0. In the python-core-driver logs across all machines, they are filed with messages below for the whole hour of training
W1008 09:16:25.774077 548 2218 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
On driver node (trainer0), dashboard.err is filled with these errors — though we never used the dashboard:
2020-10-08 09:15:24,315 WARNING node_stats.py:317 -- Unexpected channel data received, channel: b'ERROR_INFO:*', data: Failed to load data because of 'ascii' codec can't decode byte 0xec in position 11: ordinal not in range(128)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/dashboard/dashboard.py", line 697, in run
timeout=2)
File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 826, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 729, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "{"created":"@1602148527.471726244","description":"Error received from peer ipv4:172.28.95.82:62498","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Deadline Exceeded","grpc_status":4}"
And its monitor.err is filled with these:
Monitor: could not find ip for client 03fc8b37f77539bd022caa10fbd19aa23fbf18d0
Monitor: could not find ip for client dd1aa84a12ed78a2f57c548f93210a1848c136d3
Monitor: could not find ip for client 03fc8b37f77539bd022caa10fbd19aa23fbf18d0
Monitor: could not find ip for client dd1aa84a12ed78a2f57c548f93210a1848c136d3
Monitor: could not find ip for client 03fc8b37f77539bd022caa10fbd19aa23fbf18d0
For 4 of the pythone-core-worker log files on the driver (I believe these are the 4 trials?), they are much larger than others and contain these messages after cpuworker0’s raylet error (I sampled a few messages in chronological order):
I1008 10:13:08.228957 771 1473 direct_actor_transport.cc:187] Failing pending tasks for actor 6120b21703000000
I1008 10:13:08.229079 771 1473 actor_manager.cc:127] received notification on actor, state: DEAD, actor_id: 5c3e836b03000000, ip address: 172.27.134.161, port: 10001, worker_id: 788ca7ba903276d5382c515217a58828ce64bd8b, raylet_id: 3ddf99132c312ccc8af79c66f826b2dc622d5ee4, num_restarts: 0
I1008 10:13:08.229198 771 1473 direct_actor_transport.cc:187] Failing pending tasks for actor 5c3e836b03000000
I1008 10:13:08.229315 771 1473 actor_manager.cc:127] received notification on actor, state: DEAD, actor_id: 89cd273a03000000, ip address: 172.27.134.161, port: 10074, worker_id: 740db7a6fa98d329b9930fa8b5c4bedbfa879fc5, raylet_id: 3ddf99132c312ccc8af79c66f826b2dc622d5ee4, num_restarts: 0
I1008 10:13:08.229434 771 1473 direct_actor_transport.cc:187] Failing pending tasks for actor 89cd273a03000000
I1008 10:13:08.229483 771 1473 service_based_accessor.cc:791] Received notification for node id = 3ddf99132c312ccc8af79c66f826b2dc622d5ee4, IsAlive = 0
I1008 10:13:08.229501 771 1473 core_worker.cc:611] Node failure 3ddf99132c312ccc8af79c66f826b2dc622d5ee4
I1008 10:13:08.229929 771 1473 core_worker.cc:619] Object 25f56eccb8469e216120b2170300000001000000 lost due to node failure 3ddf99132c312ccc8af79c66f826b2dc622d5ee4
I1008 10:13:08.231173 771 1473 object_recovery_manager.cc:42] Starting recovery for object c2dfc437c284c9611b114c0f0300000001000000
I1008 10:13:08.231225 771 1473 core_worker.cc:619] Object 65425a548e61bc9f7c13d2ef0300000001000000 lost due to node failure 3ddf99132c312ccc8af79c66f826b2dc622d5ee4
I1008 10:13:08.231236 771 1473 object_recovery_manager.cc:42] Starting recovery for object 65425a548e61bc9f7c13d2ef0300000001000000
I1008 10:13:08.231266 771 1473 object_recovery_manager.cc:63] Lost object 25f56eccb8469e216120b2170300000001000000 has 1 locations
I1008 10:13:08.231350 771 1473 object_recovery_manager.cc:63] Lost object af78fa8b791e72e9fab5fd080300000001000000 has 0 locations
W1008 10:13:09.222667 771 1473 plasma_store_provider.cc:131] Trying to put an object that already existed in plasma: af78fa8b791e72e9fab5fd080300000001000000.
W1008 10:13:09.222667 771 1473 plasma_store_provider.cc:131] Trying to put an object that already existed in plasma: af78fa8b791e72e9fab5fd080300000001000000.
I1008 10:13:09.222728 771 1473 object_recovery_manager.cc:47] Recovery complete for object af78fa8b791e72e9fab5fd080300000001000000
I1008 10:13:09.222817 771 1473 object_recovery_manager.cc:63] Lost object a14243d4a233d5462a56d9370300000001000000 has 1 locations
W1008 10:13:15.225054 771 1473 plasma_store_provider.cc:131] Trying to put an object that already existed in plasma: c2dfc437c284c9611b114c0f0300000001000000.
I1008 10:13:15.225116 771 1473 object_recovery_manager.cc:47] Recovery complete for object c2dfc437c284c9611b114c0f0300000001000000
I1008 10:13:15.225193 771 1473 object_recovery_manager.cc:63] Lost object 65425a548e61bc9f7c13d2ef0300000001000000 has 0 locations
W1008 10:13:16.225339 771 1473 plasma_store_provider.cc:131] Trying to put an object that already existed in plasma: 65425a548e61bc9f7c13d2ef0300000001000000.
W1008 10:13:16.225339 771 1473 plasma_store_provider.cc:131] Trying to put an object that already existed in plasma: 65425a548e61bc9f7c13d2ef0300000001000000.
I1008 10:13:16.225394 771 1473 object_recovery_manager.cc:47] Recovery complete for object 65425a548e61bc9f7c13d2ef0300000001000000
I1008 10:13:16.225814 771 1473 object_recovery_manager.cc:47] Recovery complete for object 25f56eccb8469e216120b2170300000001000000
I1008 10:13:16.230865 771 1473 object_recovery_manager.cc:47] Recovery complete for object 86b4165d0ea97a2375623aae0300000001000000
I1008 10:13:16.230893 771 1473 object_recovery_manager.cc:47] Recovery complete for object bc30d377993870e56d86ca0d0300000001000000
W1008 10:13:26.228111 771 771 plasma_store_provider.cc:415] Attempted 50 times to reconstruct objects, but some objects are still unavailable. If this message continues to print, it may indicate that object's creating task is hanging, or something wrong happened in raylet backend. 9 object(s) pending: af78fa8b791e72e9fab5fd0803000000010000000e3f1cc1fb95e6965c3e836b030000000100000065425a548e61bc9f7c13d2ef0300000001000000c2dfc437c284c9611b114c0f03000000010000005b9b655a2e23f00d1d1d4f360300000001000000ca2eb887c575027089cd273a03000000010000001329ca24d0247e20e2196c2b030000000100000005dc1fa5de4fc0dc64947d35030000000100000070215c046e5ef261a14cb5b10300000001000000, etc.
W1008 10:13:26.228111 771 771 plasma_store_provider.cc:415] Attempted 50 times to reconstruct objects, but some objects are still unavailable. If this message continues to print, it may indicate that object's creating task is hanging, or something wrong happened in raylet backend. 9 object(s) pending: af78fa8b791e72e9fab5fd0803000000010000000e3f1cc1fb95e6965c3e836b030000000100000065425a548e61bc9f7c13d2ef0300000001000000c2dfc437c284c9611b114c0f03000000010000005b9b655a2e23f00d1d1d4f360300000001000000ca2eb887c575027089cd273a03000000010000001329ca24d0247e20e2196c2b030000000100000005dc1fa5de4fc0dc64947d35030000000100000070215c046e5ef261a14cb5b10300000001000000, etc.
W1008 10:14:16.252666 771 771 plasma_store_provider.cc:415] Attempted 100 times to reconstruct objects, but some objects are still unavailable. If this message continues to print, it may indicate that object's creating task is hanging, or something wrong happened in raylet backend. 9 object(s) pending: af78fa8b791e72e9fab5fd0803000000010000000e3f1cc1fb95e6965c3e836b030000000100000065425a548e61bc9f7c13d2ef0300000001000000c2dfc437c284c9611b114c0f03000000010000005b9b655a2e23f00d1d1d4f360300000001000000ca2eb887c575027089cd273a03000000010000001329ca24d0247e20e2196c2b030000000100000005dc1fa5de4fc0dc64947d35030000000100000070215c046e5ef261a14cb5b10300000001000000, etc.
…
W1008 10:33:28.116824 771 771 plasma_store_provider.cc:415] Attempted 1250 times to reconstruct objects, but some objects are still unavailable. If this message continues to print, it may indicate that object's creating task is hanging, or something wrong happened in raylet backend. 9 object(s) pending: af78fa8b791e72e9fab5fd0803000000010000000e3f1cc1fb95e6965c3e836b030000000100000065425a548e61bc9f7c13d2ef0300000001000000c2dfc437c284c9611b114c0f03000000010000005b9b655a2e23f00d1d1d4f360300000001000000ca2eb887c575027089cd273a03000000010000001329ca24d0247e20e2196c2b030000000100000005dc1fa5de4fc0dc64947d35030000000100000070215c046e5ef261a14cb5b10300000001000000, etc
Another thing that's worth noting is that since 0.8.7, we frequently see these messages in our console:
\u001b[2m\u001b[36m(pid=1064, ip=172.28.10.161)\u001b[0m 2020-10-08 00:58:26,460\u0009INFO (unknown file):0 -- gc.collect() freed 7962 refs in 0.5344230659998175 seconds
\u001b[2m\u001b[36m(pid=1302, ip=172.28.10.161)\u001b[0m 2020-10-08 00:58:26,465\u0009INFO (unknown file):0 -- gc.collect() freed 7890 refs in 0.5290356460000112 seconds
\u001b[2m\u001b[36m(pid=1424, ip=172.28.10.161)\u001b[0m 2020-10-08 00:58:26,465\u0009INFO (unknown file):0 -- gc.collect() freed 7962 refs in 0.5253929880000214 seconds
\u001b[2m\u001b[36m(pid=703, ip=172.28.10.161)\u001b[0m 2020-10-08 00:58:26,485\u0009INFO (unknown file):0 -- gc.collect() freed 8102 refs in 0.5600109100000736 seconds
\u001b[2m\u001b[36m(pid=1096, ip=172.28.10.161)\u001b[0m 2020-10-08 00:58:26,466\u0009INFO (unknown file):0 -- gc.collect() freed 352 refs in 0.5303756779999276 seconds
Given that some of the gc.collect() time could get to several seconds as training progresses, we aren't sure if this is an expected behavior with the new upgrade. Just want to check if this is now the norm. According to the raylet.out log on the driver, this seems to happen every 10 seconds.
Other than the logs shown here, most other .err files are either empty or uneventful, though I could’ve easily overlooked things. Please let me know if there’s any particular log file I should be looking at or provide.
Our assumption has been that these issues came after 0.8.5 since we’ve run many very large jobs in 0.8.5 and haven’t encountered any problem.
Can you also check if you have any meaningful logs in gcs_server.err
.
And yeah we've changed how actors are managed from the version 0.8.6 (in a more centralized way). I guess that change could be related to this issue.
And do you have any info about which process uses the memory the most when it starts exploding?
gcs_server.err
is empty.
I see messages in gcs_server.out
after the plasma store error (and nothing else for 1 hour), and here are some sampled messages chronologically.
I1008 10:13:07.989955 452 452 gcs_node_manager.cc:389] Removing node, node id = 3ddf99132c312ccc8af79c66f826b2dc622d5ee4
W1008 10:13:07.990010 452 452 gcs_placement_group_manager.cc:328] Node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4 failed, rescheduling the placement groups on the dead node.
W1008 10:13:07.990010 452 452 gcs_placement_group_manager.cc:328] Node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4 failed, rescheduling the placement groups on the dead node.
W1008 10:13:07.990022 452 452 gcs_actor_manager.cc:731] Node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4 failed, reconstructing actors.
W1008 10:13:07.990022 452 452 gcs_actor_manager.cc:731] Node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4 failed, reconstructing actors.
W1008 10:13:07.990080 452 452 gcs_actor_manager.cc:800] Actor is failed 2ab78bd103000000 on worker fada8da402dfd2f16d559a2f6c61278c4615cb6b at node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4, need_reschedule = 1, remaining_restarts = 0
W1008 10:13:07.990080 452 452 gcs_actor_manager.cc:800] Actor is failed 2ab78bd103000000 on worker fada8da402dfd2f16d559a2f6c61278c4615cb6b at node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4, need_reschedule = 1, remaining_restarts = 0
I1008 10:13:09.561029 452 452 gcs_worker_manager.cc:31] Reporting worker failure, worker id = 0b4efae1b623ef4c58fa10a39f07fbd06332fcff, node id = a471284266f5a2831a321480a5f5713c780770cf, address = 172.30.242.220
I1008 10:13:09.561241 452 452 gcs_worker_manager.cc:31] Reporting worker failure, worker id = 61d3289626bf566577c55e20db746f7ff89703bc, node id = a471284266f5a2831a321480a5f5713c780770cf, address = 172.30.242.220
I1008 10:13:09.561323 452 452 gcs_worker_manager.cc:31] Reporting worker failure, worker id = 659362b6c0464da8fc94a2e960bc04beb427a4c3, node id = 03fc8b37f77539bd022caa10fbd19aa23fbf18d0, address = 172.28.95.82
I1008 10:13:09.561383 452 452 gcs_worker_manager.cc:31] Reporting worker failure, worker id = 264a99e504c7d20652ca66ed3f0fe8f91a4af445, node id = a471284266f5a2831a321480a5f5713c780770cf, address = 172.30.242.220
I1008 10:13:09.571002 452 452 gcs_actor_manager.cc:680] Worker 128052d1c62b55829a01c5914df3a5840f49f7fe on node 03fc8b37f77539bd022caa10fbd19aa23fbf18d0 intentional exit.
W1008 10:13:09.571030 452 452 gcs_actor_manager.cc:800] Actor is failed 3ad7973003000000 on worker 128052d1c62b55829a01c5914df3a5840f49f7fe at node 03fc8b37f77539bd022caa10fbd19aa23fbf18d0, need_reschedule = 0, remaining_restarts = 0
W1008 10:13:09.571030 452 452 gcs_actor_manager.cc:800] Actor is failed 3ad7973003000000 on worker 128052d1c62b55829a01c5914df3a5840f49f7fe at node 03fc8b37f77539bd022caa10fbd19aa23fbf18d0, need_reschedule = 0, remaining_restarts = 0
I1008 10:13:09.571097 452 452 gcs_actor_manager.cc:680] Worker cd56eaaea46fa38d650ef69abde2b30a4c0c7471 on node a471284266f5a2831a321480a5f5713c780770cf intentional exit.
I1008 10:13:09.587844 452 452 gcs_actor_manager.cc:1065] Erasing actor 4f4801cc03000000 owned by c853f6e1f94d565152fc28c613ed85e9019342f0
I1008 10:13:09.587901 452 452 gcs_actor_manager.cc:558] Destroying actor, actor id = db2b89c703000000
I1008 10:13:09.587916 452 452 gcs_actor_manager.cc:1065] Erasing actor db2b89c703000000 owned by c853f6e1f94d565152fc28c613ed85e9019342f0
I1008 10:13:09.587986 452 452 gcs_actor_manager.cc:558] Destroying actor, actor id = 2da0d27603000000
I1008 10:13:09.588001 452 452 gcs_actor_manager.cc:1065] Erasing actor 2da0d27603000000 owned by c853f6e1f94d565152fc28c613ed85e9019342f0
I1008 10:13:09.588011 452 452 gcs_actor_manager.cc:558] Destroying actor, actor id = 5295519a03000000
W1008 10:13:09.597388 452 452 gcs_actor_manager.cc:542] Worker c853f6e1f94d565152fc28c613ed85e9019342f0 failed, destroying actor child.
W1008 10:13:09.597404 452 452 gcs_actor_manager.cc:542] Worker c853f6e1f94d565152fc28c613ed85e9019342f0 failed, destroying actor child.
W1008 10:13:09.597404 452 452 gcs_actor_manager.cc:542] Worker c853f6e1f94d565152fc28c613ed85e9019342f0 failed, destroying actor child.
W1008 10:13:09.597421 452 452 gcs_actor_manager.cc:542] Worker c853f6e1f94d565152fc28c613ed85e9019342f0 failed, destroying actor child.
W1008 10:13:09.597421 452 452 gcs_actor_manager.cc:542] Worker c853f6e1f94d565152fc28c613ed85e9019342f0 failed, destroying actor child.
I1008 10:13:19.293956 452 452 gcs_actor_manager.cc:104] Registering actor, job id = 03000000, actor id = 26b50a7003000000
I1008 10:13:19.295037 452 452 gcs_actor_manager.cc:109] Registered actor, job id = 03000000, actor id = 26b50a7003000000
I1008 10:13:19.295799 452 452 gcs_actor_manager.cc:127] Creating actor, job id = 03000000, actor id = 26b50a7003000000
I1008 10:13:19.296147 452 452 gcs_actor_scheduler.cc:196] Start leasing worker from node a471284266f5a2831a321480a5f5713c780770cf for actor 26b50a7003000000
I1008 10:13:19.320003 452 452 gcs_actor_manager.cc:104] Registering actor, job id = 03000000, actor id = 9f34da8403000000
I1008 10:13:19.320806 452 452 gcs_actor_manager.cc:109] Registered actor, job id = 03000000, actor id = 9f34da8403000000
I1008 10:13:19.321512 452 452 gcs_actor_manager.cc:127] Creating actor, job id = 03000000, actor id = 9f34da8403000000
I1008 10:13:19.321856 452 452 gcs_actor_scheduler.cc:196] Start leasing worker from node 09bc5ba21e2bb67134f86367b9cf8ad53fd503a4 for actor 9f34da8403000000
I1008 10:13:19.325047 452 452 gcs_actor_scheduler.cc:240] Finished leasing worker from 09bc5ba21e2bb67134f86367b9cf8ad53fd503a4 for actor 9f34da8403000000
I1008 10:13:19.325140 452 452 core_worker_client_pool.cc:28] Connected to 172.27.122.67:10022
I1008 10:13:19.326241 452 452 gcs_actor_scheduler.cc:322] Start creating actor 9f34da8403000000 on worker a27ce1ebec1c1efce3eeef588e2b1ed7e69d6204 at node 09bc5ba21e2bb67134f86367b9cf8ad53fd503a4
I don't know which process uses the most memory. But it seems like the plasma error happened before the drastic increase in memory, and not as a result of the increase in memory usage (the memory usage peaked 10 mins after the everything went haywire). This is assuming the timestamps between the workers and Grafana are synced correctly.
I see. So, I think what happened is one of your nodes had out of memory for some reason, and actor on that nodes are dead, and that just made your script failed (because you did ray.get on dead actors' ref).
There were drastic changes in how to manage the actor lifecycle from 0.8.6, but it doesn't seem to be really related based on your logs (the basic idea is now all actors are managed by gcs_server, and it looks like it worked correct).
@ericl Idk the detail of object broadcasting issues, but is there any way that's related to this failure?
Also, @soundway, when you ran with 0.8.5, did you see the same memory explosion issue?
also cc @stephanie-wang @barakmich
With 0.8.5 all memory usages are stable for long running jobs with many actors/nodes.
What would cause the plasma error as I showed earlier? (i.e. failed to find entry in mmap_records for fd 0
) I didn't find anything too useful from Google. That seems like the root cause as it has the earliest timestamp of all events. OOM seems to be the aftermath, and I have jobs that failed without the spikes in memory.
I've been trying to get logs for the vanilla setup of QBert+PPO with the reproduction script I provided at the beginning of this issue (1 x p3.2xlarge + 1 x c5.18xlarge). It takes about 20-60+ hours to reproduce. The error in raylet.err
does look slightly different from when we were running our custom environment but it's again coming from Plasma store. I removed most of the duplicated IOError: 14
messages. The memory usage also began to increase around the unix timestamp the error was thrown (though it didn't reach peak until a few minutes later).
E1013 08:51:42.808821 490 490 object_manager.cc:173] b5a2fc76f2fb8fc0afaed5c60200000003000000 attempted to pull an object that's already local.
E1013 09:08:40.582950 490 490 object_manager.cc:233] The object manager with ID fc03b82c14383ce31d349f62b1dd4c4d820fa70c is trying to pull object db00ef4a5318cc44afaed5c60200000003000000 but the object table suggests that this object manager already has the object. The object may have been evicted.
E1013 09:09:50.811242 490 490 object_manager.cc:233] The object manager with ID fc03b82c14383ce31d349f62b1dd4c4d820fa70c is trying to pull object 011a69285e40df7dafaed5c60200000003000000 but the object table suggests that this object manager already has the object. The object may have been evicted.
E1013 09:24:41.045271 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.045377 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.045403 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.045436 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.045487 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405635 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405663 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405692 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405714 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405815 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405865 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405961 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.406013 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.406117 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.406142 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
*** Aborted at 1602581362 (unix time) try "date -d @1602581362" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x59) received by PID 490 (TID 0x7f7c297fa700) from PID 89; stack trace: ***
@ 0x7f7c323de8a0 (unknown)
@ 0x5590123261cd std::__detail::_Map_base<>::operator[]()
@ 0x55901232b84b plasma::PlasmaStore::ReturnFromGet()
@ 0x55901232b331 boost::asio::detail::wait_handler<>::do_complete()
@ 0x5590128e046f boost::asio::detail::scheduler::do_run_one()
@ 0x5590128e1971 boost::asio::detail::scheduler::run()
@ 0x5590128e29a2 boost::asio::io_context::run()
@ 0x5590123311ee plasma::PlasmaStoreRunner::Start()
@ 0x7f7c32a469e0 (unknown)
@ 0x7f7c323d36db start_thread
@ 0x7f7c31cdca0f clone
During this process I was tweaking some of the settings and I might've found a potentially separate issue given the new error I'm getting. The only things I changed are max_t
and grace_period
in the scheduler to make each trial much shorter (20 and 10 respectively). This would result in Check failed: actor_registry_.count(id.ActorId()) > 0
after running for several hours as opposed to days. Pasting the script again with the changes:
import copy
import gym
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo
if __name__ == '__main__':
ray.init(address="auto")
config = copy.deepcopy(ppo.DEFAULT_CONFIG)
config.update({
"rollout_fragment_length": 32,
"train_batch_size": 8192,
"sgd_minibatch_size": 512,
"num_sgd_iter": 1,
"num_workers": 256,
"num_gpus": 1,
"num_sgd_iter": 1,
"num_cpus_per_worker": 0.25,
"num_cpus_for_driver": 1,
"model": {"fcnet_hiddens": [1024, 1024]},
"framework": "torch",
"lr": ray.tune.sample_from(lambda s: np.random.random()),
})
trainer_cls = ppo.PPOTrainer
config["env"] = "QbertNoFrameskip-v4"
ray.tune.run(trainer_cls,
config=config,
fail_fast=True,
reuse_actors=False,
queue_trials=True,
num_samples=10000,
scheduler=ray.tune.schedulers.ASHAScheduler(
time_attr='training_iteration',
metric='episode_reward_mean',
mode='max',
max_t=5,
grace_period=5,
reduction_factor=3,
brackets=3),
)
Here's a sampled copy of raylet.err
. I removed most of the duplicates (all IOError: 14
). It's unclear if this new error is related to the one we found when running our own custom environment or when running QBert with long trial durations. The timing of this error is also right before we see increase in memory usage.
E1013 23:35:50.200567 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.200614 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.200753 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 1: Cancelled
E1013 23:35:50.200881 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.201339 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1013 23:35:50.201450 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 1: Cancelled
E1013 23:35:50.201604 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.201673 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1013 23:35:50.204852 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.204876 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.204927 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.204946 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:38:41.597056 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1013 23:38:41.597164 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1013 23:38:41.597545 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1013 23:41:24.079257 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:41:24.079306 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:41:24.079330 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:41:24.079609 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:41:24.370667 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
E1014 00:19:38.115996 490 490 object_manager.cc:173] 6e6502fb8f6647db10a4406c0100000003000000 attempted to pull an object that's already local.
E1014 00:20:40.974277 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 00:53:53.745927 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 00:53:53.746177 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 1: Cancelled
E1014 00:53:53.746242 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 00:53:53.746289 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 00:53:53.746348 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1014 00:53:53.746407 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 00:57:30.838287 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 01:06:51.857200 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1014 01:06:51.857285 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1014 01:06:51.858800 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 01:06:51.859259 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 01:06:51.859380 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 01:06:51.859437 490 490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 01:14:52.450027 490 490 object_manager.cc:233] The object manager with ID a9f645669d1cbcc66f116857fc64d00cfbf9f245 is trying to pull object 5d4202bb609c5c6e25e7fd4c0100000003000000 but the object table suggests that this object manager already has the object. The object may have been evicted.
F1014 01:17:02.833871 490 490 node_manager.cc:2833] Check failed: actor_registry_.count(id.ActorId()) > 0
*** Check failure stack trace: ***
@ 0x56294c1749cd google::LogMessage::Fail()
@ 0x56294c175b2c google::LogMessage::SendToLog()
@ 0x56294c1746a9 google::LogMessage::Flush()
@ 0x56294c1748c1 google::LogMessage::~LogMessage()
@ 0x56294c12cad9 ray::RayLog::~RayLog()
@ 0x56294be4d51d ray::raylet::NodeManager::HandleObjectLocal()
@ 0x56294be4d6c5 _ZNSt17_Function_handlerIFvRKN3ray14object_manager8protocol11ObjectInfoTEEZNS0_6raylet11NodeManagerC1ERN5boost4asio10io_contextERKNS0_8ClientIDERKNS7_17NodeManagerConfigERNS0_13ObjectManagerESt10shared_ptrINS0_3gcs9GcsClientEESL_INS0_24ObjectDirectoryInterfaceEEEUlS5_E1_E9_M_invokeERKSt9_Any_dataS5_
@ 0x56294be820df _ZN5boost4asio6detail18completion_handlerIZN3ray30ObjectStoreNotificationManager15ProcessStoreAddERKNS3_14object_manager8protocol11ObjectInfoTEEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
@ 0x56294c47246f boost::asio::detail::scheduler::do_run_one()
@ 0x56294c473971 boost::asio::detail::scheduler::run()
@ 0x56294c4749a2 boost::asio::io_context::run()
@ 0x56294bd97fde main
@ 0x7ffb5ef03b97 __libc_start_main
@ 0x56294bdaebb1 (unknown)
@waldroje @soundway I was investigating this and realized there is a high likelihood the issue is the same as https://github.com/ray-project/ray/issues/11309, which is triggered by creating and destroying many actors.
To validate this, could you try the following?
If this doesn't crash, then it is likely the same issue as 11309. If not, then it would be great if you could include the simplified script that triggers the crash without ASHAScheduler (ideally, simplified to just a rllib train
command instead of a script).
Oh forgot to mention, you can also try the ulimit workaround in that issue: https://github.com/ray-project/ray/issues/11309
I’ll run some stuff tomorrow morning and revert
On Oct 19, 2020, at 6:39 PM, Eric Liang notifications@github.com wrote:
Oh forgot to mention, you can also try the ulimit workaround in that issue: #11309 https://github.com/ray-project/ray/issues/11309 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/11239#issuecomment-712480607, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHSHWMLQU3A2DM7F765FCI3SLS5ZXANCNFSM4SGQX3BA.
@soundway @waldroje after experimenting with 1.0 vs 0.8.5 more, I think the main difference is we use 2-3x more file descriptors due to the change in the way actors are managed with the GCS service--- it's not really a leak. I'll put more details in the linked bug.
I believe that increasing the file descriptor limit (ulimit -n
value) will resolve the problem, can you try increasing the limit to 10000 or more? The number of fds opened seems to stabilize at just a few thousand.
I upped my file descriptors to ~16k… but still crashing, but getting more specific error now.. not something I’ve seen before..
Ray worker pid: 24628 WARNING:tensorflow:From /home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariabl e.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. 2020-10-21 00:25:36,762 ERROR worker.py:372 -- SystemExit was raised from the worker Traceback (most recent call last): File "/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2328, in get_attr pywrap_tf_session.TF_OperationGetAttrValueProto(self._c_op, name, buf) tensorflow.python.framework.errors_impl.InvalidArgumentError: Operation 'default_policy/Sum_4' has no attr named '_XlaCompile'.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 331, in _MaybeCompile xla_compile = op.get_attr("_XlaCompile") File "/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2332, in get_attr raise ValueError(str(e)) ValueError: Operation 'default_policy/Sum_4' has no attr named '_XlaCompile'.
Is the gist of it…. It’s late here, and I’ll post a more complete set of logs, and make sure I am not making any mistakes.. but figured I’d pas that along...
On Oct 20, 2020, at 9:12 PM, Eric Liang notifications@github.com wrote:
@soundway https://github.com/soundway @waldroje https://github.com/waldroje after experimenting with 1.0 vs 0.8.5 more, I think the main difference is we use 2-3x more file descriptors due to the change in the way actors are managed with the GCS service--- it's not really a leak. I'll put more details in the linked bug.
I believe that increasing the file descriptor limit (ulimit -n value) will resolve the problem, can you try increasing the limit to 10000 or more? The number of fds opened seems to stabilize at just a few thousand.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/11239#issuecomment-713229044, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHSHWMNMMR2DMGHPPFRHFHTSLYYRXANCNFSM4SGQX3BA.
@ericl I did some tests and I have some doubt that this is the root cause for our problem for several reasons. Though there might be alternative reasons for my observations.
The ulimit for file descriptor per process for our system is already at 65K according to ulimit -a
(and both hard/soft limit are the same at 65K). Unfortunately, I couldn't make this any higher due to our infra setup, but this should be more than sufficient. I'm running a few Q-bert jobs with long training iterations and the FIFO scheduler, but it will take a while to know whether they will crash. Will keep you updated.
That said, I looked at the number of file descriptors of two other types of jobs where eventually they tend to crash after a long time.
Unfortunately in our system we can't use lsof
, so I'm using some approximations.
For setup 1, the largest observed fd number per process doesn't exceed 1700. So the 65K limit per process in theory should be far more than enough. I got it via:
ls /proc/*/task/*/fd /proc/*/fd | sort -nr | head -n 10
The overall number of opened file descriptors according to the command below never exceeds 200K, but I do observe very large fluctuations between trial resets.
cat /proc/sys/fs/file-nr
For setup 2, the max fds per process and the total number of fds when I use the commands above don't exceeds 600/140K respectively.
In either case, the max number of fds per process is far less than the 65K threshold. The max number of fds across the system is also far less than the 78M allowed. Maybe something eventually would trigger a sudden increase in fd usage.
In the past, we observed jobs that crash somewhat early during training (with relatively long min trial durations), and it didn't appear that it was during a trial reset.
Based on these preliminary data, it appears that our problem won't be solved easily by just increasing ulimit
(and we couldn't increase it easily any more anyway). Please let me know if you think there's any issue with the way I approached this.
Thanks for the investigations @soundway. It does indeed sound like a separate issue.
I'm trying to think of how to best reproduce this. Based on what you said, it sounds like it's potentially something triggered by the characteristics of a custom environment, and potentially TensorFlow related. To help me isolate the cause and hopefully get a fast repro, could you try the following?
Probably (1) and (2) are the most critical, since if there's a standalone repro script that runs in a few minutes, I can always try the other steps myself pretty easily.
For (1) I have tried something like that (with varying obs size, model size, environment memory occupancy), but unfortunately it still takes roughly the same amount of time to crash, and I could not get it to crash within minutes. Though I haven't done this too rigorously and I can revisit this.
I haven't tried running long jobs in the cloud with just one instance yet in the but I can try that.
I never use TF -- everything I've done here is with Torch (1.6 specifically, but we also see problem with 1.4). I could try TF as well.
I started (4) with current setup with Q-bert and they haven't crashed yet, but will give you updates on this.
I see, minutes would be ideal but even a couple hours isn't too bad. 40 hours is a bit hard to deal with though :/
I tried various configurations today, and I increased my soft/hard ulimit -n to ~63k on head & workers…
Probably the only thing I can say is that for the most part, the same Trial, using fairly limited resources, 32 workers… ran fine on a small cluster, roughly 32-150 cpus… but as the number nodes increased the Trial would begin breaking, mostly right after starting, prior to completing a complete iteration…. Yet a couple times I was able to get the Trial to run for a long time on a larger cluster…
At no time can I ever get large resource usage on a large cluster… everything breaks… yet I can’t seem to come up with an example to replicate, as my MockEnv runs fine…so clearly something about the custom env/model relative to the basic MockEnv that creates problems…
On Oct 21, 2020, at 2:15 PM, Eric Liang notifications@github.com wrote:
I see, minutes would be ideal but even a couple hours isn't too bad. 40 hours is a bit hard to deal with though :/
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/11239#issuecomment-713770664, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHSHWMJWGAXDHZUPIPDSS3TSL4QLBANCNFSM4SGQX3BA.
@stephanie-wang Looking through the above posted logs more carefully, there are a couple things that look odd to me:
We seem to be attempting to reconstruct objects, but for RLlib all objects are either ray.put or actor returns. Shouldn't this code path never be triggered?
I'm wondering if the assert here may be incorrect due to race conditions, triggered by GCS actor management. Maybe we should remove it? https://github.com/ray-project/ray/blame/5d7f271e7d275d74e322c146ecc80cc9346238dd/src/ray/raylet/node_manager.cc#L2787
Note that (2) would go away with the new scheduler.
@ericl For 1., in our custom training/environment, we have some code in the training loop that dispatches remote calls and fetches the results via ray.get
. Removing that code will still lead to crashes (but keeping it does seem to make it fail faster in general). The version of the log I provided I believe was using code that included the customization.
The Qbert reproduction script doesn't contain any of such customization. I didn't find any reconstruction related messages in the failed Qbert logs other than the GCS attempting to reconstructing actors after a node has failed.
Ok, it could be the reconstruction is just a symptom of the node failure initially caused by the check failure.
On Thu, Oct 22, 2020, 11:41 PM soundway notifications@github.com wrote:
@ericl https://github.com/ericl For 1., in our custom training/environment, we have some code in the training loop that dispatches remote calls and fetches the results via ray.get. Removing that code will still lead to crashes (but keeping it does seem to make it fail faster in general). The version of the log I provided I believe was using code that included the customization.
The Qbert reproduction script doesn't contain any of such customization. I didn't find any reconstruction related messages in the failed Qbert logs other than the GCS attempting to reconstructing actors after a node has failed.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/11239#issuecomment-714981403, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSWYMJFKEF7VXR2PKUDSMEQP7ANCNFSM4SGQX3BA .
@soundway could you try again with the assert removed? (you can use the nightly wheels)
Sure, I can try that. I suppose that would enable me to be more aggressive about resetting trials.
A bit of update on some of the jobs I run lately. For Q-Bert, I launched several jobs without trial reset, and after almost a week, none has crashed so far. I don't think I had any past Q-bert jobs survived this long when training with the ASHA scheduler.
On the other hand, I also launched some jobs with our environment with long trial duration, and they still crashed after hours to days. Though on average they seem to take longer to crash, but it's a small sample size.
So I guess the conclusion here is that while resetting trials seems to accelerate the crashing, it's not a necessary condition.
long trial duration, and they still crashed after hours
To clarify, is this infinite duration or just long? The error messages you got earlier should only be occuring at the boundary of trial termination / startup, so I'd be surprised to see them if the trials are running forever.
(I'd be interested in the error logs for crashed infinite duration trials if available).
@ericl Just long. I set a really large stopping criterion for training_iteration
(1M in this case). And according to the Tune output when it crashed, the job that got the error is at around ~7K steps. The only interesting error message I found is again related to the mmap_records for fd 0
in raylet.err
. Unless the training_iteration
criterion does something different than what I expect, none of the trials should've been terminated.
Also, maybe you are referring to the error @waldroje posted? I think things may have gotten a little bit mixed up here in this thread with reports from the other issue.
Lemme just change my avatar to make it easier to parse this thread. Our avatars happen to look the same...
I see, so it seems we still don't have a good root cause for the mmap crash (which is not related to the actor check failure?)
Btw
For 1., in our custom training/environment, we have some code in the training loop that dispatches remote calls and fetches the results via ray.get. Removing that code will still lead to crashes (but keeping it does seem to make it fail faster in general). The version of the log I provided I believe was using code that included the customization.
Back on the subject of a sharable repro, how about making a fake env that still dispatches the remote calls (with similar sized zero-out objects etc)? You mentioned a mock env before was not able to reproduce. I suspect the remote calls are critical to reproducing if the error is mmap/object store related.
apologies, been offline for a couple days... not 100% sure @soundway problem and mine are the same, although my error traces look exactly the same... but I tried a bunch of iterations today... different combinations of nodes to make sure it wasn't a specific node causing problems... but I was able to run trials under many scenarios of 4-6 nodes... but when I go to 18 nodes, my trials die immediately, point being I was able to run both locally and remotely, but yet fails at some sort of resource requirement... since it fails immediately, i'm attaching a copy of the log directory... let me know if anything there is helpful... I'm still trying to produce a mock env that reproduces this problem... logs.zip
@ericl Didn't get to spend too much time on this in the last week but thought I should give an update on my part.
Regarding the dispatches of remote calls -- that was one of my earlier suspicions as well, so I wrote some similar versions for the Q-bert/mocked environments quite early on. Unfortunately, its inclusion doesn't seem to have much impact (and I just recently tested them again and observed the same thing). In fact, if anything the data suggests that adding these extra calls seems to lengthen the time for things to crash. Our own environment does a lot of things with processes/threads/files/network, and have many dependencies. So from the looks of it so far, those are more likely to be related to the crash than these remote calls.
I've been unsuccessful at creating a script that can reproduce the issue on the order of minutes on a single instance with vanilla/mock environment. On a p2.16xlarge instance, I was able to quite consistently recreate the issue within a day if I reset the trials every 5 iterations. However, I sift through the error logs and couldn't find anything interesting in there (i.e. no errors like mmap_records for fd 0
). So this could be a separate issue.
Here's the slightly modified code from earlier:
import copy
import gym
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo
if __name__ == '__main__':
ray.init(address="auto")
config = copy.deepcopy(ppo.DEFAULT_CONFIG)
config.update({
"rollout_fragment_length": 32,
"train_batch_size": 2048,
"sgd_minibatch_size": 512,
"num_sgd_iter": 1,
"num_workers": 64,
"num_gpus": 2,
"num_sgd_iter": 1,
"num_cpus_per_worker": 0.20,
"num_cpus_for_driver": 1,
"model": {"fcnet_hiddens": [1024, 1024]},
"framework": "torch",
"lr": ray.tune.sample_from(lambda s: np.random.random()),
})
trainer_cls = ppo.PPOTrainer
config["env"] = "QbertNoFrameskip-v4"
ray.tune.run(trainer_cls,
config=config,
fail_fast=True,
reuse_actors=False,
queue_trials=True,
num_samples=10000,
scheduler=ray.tune.schedulers.ASHAScheduler(
time_attr='training_iteration',
metric='episode_reward_mean',
mode='max',
max_t=5,
grace_period=5,
reduction_factor=3,
brackets=3),
)
Hi,
We did observed very similar behavior on our end after updating to 0.8.7 and 1.0.0. We mainly utilize PPO with custom build cluster consisting of four machines. We as well use custom environment. Problem gets more frequent after using PPO in multiagent configuration. We as well observed some correlation with train_batch_size
- in current multiagent case value of 1024 * 256
works fine and stable - at least for 48h+. For twice the size of batch size (2 * 1024 * 256
) we do observe slow degradation of learning process- number of healthy workers decreases over time as on the following graph:
Not sure if this is obvious or helpful, but with each decrease in healthy actors count the 'step' counter is reset to zero- below is the same graph of number of healthy workers but in the function of step, not time:
When we did set 'train_batch_size' to '4 1024 256' training process fails after first optimization. Important not is that we successfully utilized last mentioned 'train_batch_size' in the past without stability issues in case of single agent training. We are as well below physical limits of cluster by a far margin, however increasing environment demands has negative effect for stability for sure. However as pointed already, issue is hard in debugging - however if you would like to we may conduct some guided experiments in our configuration.
Hi there, I'm experiencing similar actor exited unexpectedly errors but with RaySGD. I was new to ray and got started straight from version 1.0.0 with Python3.7. I mainly use RaySGD and Tune to run data-parallel training on CPUs in a cluster of beefy ubuntu nodes. I'm dealing with tabular data so I preload them into the memory store and let actors read from there. The entire training process is expected to be long (dozens of hours at least), but I would always get some actor exited errors after first 1~2 hours causing the entire workload to shut down. The RAM usage on each node was never above 50%. I've also set ulimit -n to 2^20 (1 million).
Hi there, I'm experiencing similar (although might not be exactly the same) issue with both gcs reporting a stream of :
gcs_server.out:I1117 17:50:03.524953 59 59 gcs_worker_manager.cc:32] Reporting worker failure, worker id = 9e5bc2ff1515401c181db192ee0a0d3dfccd5a48, node id = ec63ba9921092c53a281629a004c547f7f8353bc, address = 10.120.253.12
and monitor reporting a stream of:
Monitor: could not find ip for client c77390ddb83f252b89c3d4d412d30b45c0e96f2a
Similar to @waldroje above, this doesn't happen with 3-4 nodes in a cluster, but starts when number of nodes is increased. This didn't happen when used v0.7.3 but does happen with latest master and v1.0.1
Strangely, Tune reports 628 cores to be occupied over 8 trials, but only 6 trials are actually reporting stats with total of 434 occupied cores:
== Status ==
Memory usage on this node: 15.4/187.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 628/640 CPUs, 0/0 GPUs, 0.0/2615.67 GiB heap, 0.0/776.37 GiB objects
Result logdir: /home/kz430x/ray_results/stress_testing_fewer_nodes
Number of trials: 9/18 (1 PENDING, 8 RUNNING)
+-------------------------------------------------------+----------+-------------------+-----------------------+---------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+
| Trial name | status | loc | num_envs_per_worker | num_workers | rollout_fragment_length | iter | total time (s) | ts | reward | episode_reward_max | episode_reward_min | episode_len_mean |
|-------------------------------------------------------+----------+-------------------+-----------------------+---------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------|
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00000 | RUNNING | 10.120.253.69:80 | 3 | 31 | 50 | 99 | 2178.26 | 998152 | 0.515539 | 1.48895 | 0.410766 | 32.9055 |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00001 | RUNNING | 10.120.252.234:80 | 5 | 31 | 50 | 83 | 2185.02 | 838823 | 0.671696 | 0.671751 | 0.66081 | 50 |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00002 | RUNNING | 10.120.252.177:79 | 3 | 93 | 50 | 136 | 2162.26 | 1367871 | 0.442033 | 0.485261 | 0.441982 | 6.00777 |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00003 | RUNNING | 10.120.253.35:78 | 5 | 93 | 50 | 111 | 2175.79 | 1122076 | 0.671751 | 0.671751 | 0.671751 | 50 |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00004 | RUNNING | | 3 | 155 | 50 | | | | | | | |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00005 | RUNNING | 10.120.253.51:81 | 5 | 155 | 50 | 109 | 2183.33 | 1102251 | 0.508532 | 1.35241 | 0.355423 | 18 |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00006 | RUNNING | 10.120.252.203:80 | 3 | 31 | 100 | 73 | 2209.38 | 744252 | 0.54014 | 0.602843 | 0.517977 | 10.0118 |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00007 | RUNNING | | 5 | 31 | 100 | | | | | | | |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00008 | PENDING | | 3 | 93 | 100 | | | | | | | |
+-------------------------------------------------------+----------+-------------------+-----------------------+---------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+
@ericl I'm not sure the ulimit solution is relevant in my case, since this is the output of ulimit -a inside my docker container:
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 18446744073709551597
file size (blocks, -f) unlimited
pending signals (-i) 768188
max locked memory (kbytes, -l) 16384
max memory size (kbytes, -m) unlimited
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 40960
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
I'm attaching my logs:
@OrlMat what's the last stable Ray version you managed to make it work at scale...? Did you say v0.8.6 doesn't work for you either??
@OrlMat what's the last stable Ray version you managed to make it work at scale...? Did you say v0.8.6 doesn't work for you either??
We started to observe issues after upgrade to v0.8.7 - I have already edited the post.
I'll spend more time trying to reproduce this. Seems like I might just have to bite the bullet of waiting a couple days for a crash.
As usual, if anyone has a fast reproducible crash that would be very helpful.
Btw @soundway there is some possibility part of your issue is solved by https://github.com/ray-project/ray/pull/11869
Which is in latest wheels. However, that issue should manifest immediately and not after a long period of time.
@OrlMat @roireshef @soundway Also, can you guys give us logs that are from the crashed nodes? It seems like everyone's issue was basically stemmed from random node failures that we don't know why they happen.
Quick update, I am currently running the following on a 10-node cluster:
rllib train --run=PPO --env=CartPole-v0 --config={"num_workers": 32} --stop={"timesteps_total": 1000} --num-samples=10000 --ray-address=auto -vv
After a few hours (2000 / 10000 trials complete), the raylet is at 100% CPU which is causing heartbeat delays. This seems likely to result in random node failures if raylet CPU load continues to increase.
@OrlMat @roireshef @soundway do you also observe 100% CPU usage of raylet in your workloads after a while?
I'll spend more time trying to reproduce this. Seems like I might just have to bite the bullet of waiting a couple days for a crash.
As usual, if anyone has a fast reproducible crash that would be very helpful.
@ericl In my setup, just as I start the Tune cluster I immediately observe issues, like trials not starting, although marked as RUNNING. After a while (can be ~1 day) they are marked as FAILED. Please see the following output (from ~1 day since beginning of Tune run). Just to give some context, this is 32cores/node (CPU-only) with ~128GB RAM cluster that has enough nodes/cores to be able to run all trials simultaneously.
+------------------------------------------------------------+----------+-------------------+-----------------------+---------------+---------------------------+--------+------------------+---------+----------+
| Trial name | status | loc | num_envs_per_worker | num_workers | rollout_fragment_length | iter | total time (s) | ts | reward |
|------------------------------------------------------------+----------+-------------------+-----------------------+---------------+---------------------------+--------+------------------+---------+----------|
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00000 | ERROR | | 3 | 31 | 50 | 570 | 4869.03 | 2897185 | 0.686032 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00001 | RUNNING | 10.120.253.10:69 | 5 | 31 | 50 | 480 | 6518.52 | 2470741 | 0.705774 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00002 | RUNNING | 10.120.253.5:81 | 3 | 93 | 50 | 714 | 4301.41 | 3628635 | 0.748065 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00003 | RUNNING | | 5 | 93 | 50 | | | | |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00004 | RUNNING | 10.120.252.231:75 | 3 | 155 | 50 | 1587 | 1951.24 | 8107295 | 0.671751 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00005 | RUNNING | 10.120.253.66:74 | 5 | 155 | 50 | 453 | 6440.03 | 2283493 | 0.540052 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00006 | RUNNING | 10.120.252.198:79 | 3 | 31 | 100 | 433 | 6805.06 | 2244494 | 0.679827 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00007 | RUNNING | 10.120.252.243:89 | 5 | 31 | 100 | 310 | 9105.32 | 1588171 | 0.507826 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00008 | RUNNING | 10.120.253.53:85 | 3 | 93 | 100 | 443 | 6774.8 | 2279045 | 0.50638 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00009 | RUNNING | 10.120.252.222:79 | 5 | 93 | 100 | 330 | 8890.16 | 1680894 | 0.513259 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00010 | RUNNING | 10.120.252.229:76 | 3 | 155 | 100 | 473 | 6002.71 | 2437666 | 0.542599 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00011 | RUNNING | 10.120.252.198:82 | 5 | 155 | 100 | 312 | 8455.52 | 1593173 | 0.417053 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00012 | RUNNING | 10.120.252.216:77 | 3 | 31 | 200 | 270 | 8964.67 | 1484697 | 0.488652 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00013 | RUNNING | 10.120.252.201:61 | 5 | 31 | 200 | 151 | 10504.3 | 762662 | 0.458328 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00014 | RUNNING | 10.120.252.202:78 | 3 | 93 | 200 | 261 | 8668.87 | 1437558 | 0.478272 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00015 | ERROR | | 5 | 93 | 200 | 127 | 9100.73 | 640695 | 0.432807 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00016 | RUNNING | 10.120.252.209:76 | 3 | 155 | 200 | 249 | 9255.21 | 1368668 | 0.437783 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00017 | RUNNING | 10.120.253.25:87 | 5 | 155 | 200 | 46 | 21294.9 | 233320 | 0.596508 |
+------------------------------------------------------------+----------+-------------------+-----------------------+---------------+---------------------------+--------+------------------+---------+----------+
Number of errored trials: 2
+------------------------------------------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name | # failures | error file |
|------------------------------------------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00000 | 1 | /home/kz430x/ray_results/stress_testing_54/new_base_NegotiationMergeHandoffFullLateralEnv_0_num_envs_per_worker=3,num_workers=31,rollout_fragment_length=50_2020-11-17_20-47-526qieqnu0/error.txt |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00015 | 1 | /home/kz430x/ray_results/stress_testing_54/new_base_NegotiationMergeHandoffFullLateralEnv_15_num_envs_per_worker=5,num_workers=93,rollout_fragment_length=200_2020-11-17_20-47-53s2c_xath/error.txt |
+------------------------------------------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Errors:
$kz430x@dcmixphhpc009:~> cat /home/kz430x/ray_results/stress_testing_54/new_base_NegotiationMergeHandoffFullLateralEnv_0_num_envs_per_worker=3,num_workers=31,rollout_fragment_length=50_2020-11-17_20-47-5qieqnu0/error.txt
Failure # 1 (occurred at 2020-11-18_03-14-29)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 468, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1474, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::new_base.train() (pid=68, ip=10.120.253.26)
File "python/ray/_raylet.pyx", line 446, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 400, in ray._raylet.execute_task.function_executor
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 497, in train
raise e
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 486, in train
result = Trainable.train(self)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 261, in train
result = self._train()
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 132, in _train
return self._train_exec_impl()
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 170, in _train_exec_impl
res = next(self.train_exec_impl)
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 731, in __next__
return next(self.built_iterator)
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 744, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 814, in apply_filter
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 814, in apply_filter
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 744, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 799, in add_wait_hooks
item = next(it)
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 744, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 525, in base_iterator
batch = ray.get(obj_id)
ray.exceptions.UnreconstructableError: Object 0d4ced372d8e5c5ffc7dfb5a010000c001000000 is lost (either LRU evicted or deleted by user) and cannot be reconstructed. Try increasing the object store memory available with ray.init(object_store_memory=<bytes>) or setting object store limits with ray.remote(object_store_memory=<bytes>). See also: https://docs.ray.io/en/latest/memory-management.html
@ericl - please also see the total time column. I've verified the trials that ARE running have started more on less (+-20sec) at the same time. The total time reported for some trials is completely off... Actually this was sampled 6-7 hours after the beginning of the experiment, so the only one shows total time close the actual time passed is the last one. All other trials are way off.... I hope this helps in any way
@ericl @rkooo567 I'll make sure to provide you with an reproducible example soon, but note that this doesn't happen when using single machine, so you'll need to run Tune over a cluster of few Ray nodes.
@ericl @rkooo567 to reproduce, please install FLOW package using the instructions in: https://flow.readthedocs.io/en/latest/flow_setup.html#local-installation-of-flow - NOTE: I skipped installing conda and installed it in my docker contrainer instead.
Next, clone the environment example from : https://github.com/roireshef/flow_rllib_public
Finally, setup your Ray cluster, add ray.init()
command to the tune script, and you should be ready to go.
Note that issues occur when using relatively large clusters (use +20 nodes, 32cores each) so that multiple trials are running simultaneously. I use for head node:
ray start --block --head \
--port=$redis_port \
--redis-password=$redis_password \
--node-ip-address=$head_node_ip \
--gcs-server-port=6005 \
--dashboard-port=6006 \
--node-manager-port=6007 \
--object-manager-port=6008 \
--redis-shard-ports=6400,6401,6402,6403,6404,6405,6406,6407,6408,6409 \
--min-worker-port=6100 --max-worker-port=6299 \
--include-dashboard=false
And for workers:
ray start --block \
--address=$redis_address \
--redis-password=$redis_password \
--node-ip-address=$worker_node_ip \
--node-manager-port=6007 \
--object-manager-port=6008 \
--min-worker-port=6100 --max-worker-port=6299
Thanks @roireshef , I'll try those next.
For now, seeing issues with plain PPO on CartPole
rllib train --run=PPO --env=CartPole-v0 --config='{"num_workers": 32}' --num-samples=100 --ray-address=auto -vv
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-30-0-248 is used (59.69 / 62.79 GB). The top 10 memory consumers are:
PID MEM COMMAND
73246 56.69GiB ray::PPO
Going to try upgrading tensorflow (realized this node was using 1.15 which is pretty old).
Update: this appears to be resolved with TF2.1.
It seems Ray nightly fixes the raylet 100% CPU bug now. I'm now running the following:
simple_tune.py
on a cluster of 20 m4.4xl workers (16 CPUs each). Tensorflow 2.1 / Ray nightly.Status: (1) -- ran ok up to 4k trials, eventually crashed with unrelated Tune bug: https://github.com/ray-project/ray/issues/12227 (2) -- ? (3) -- ?
Here are the logs from our last run, which resulted in crash of ~10 workers after first train batch collection. I did not included logs from workers because they consist a lot of lines logged from our environment. We have used _ray-1.1.0.dev0-cp37-cp37m-manylinux2014_x8664.whl, which btw seems to fails / freezes after first optimization (actors still collects data but no further optimizations are triggered).
This is not a contribution.
Versions: python: 3.6.8 ray: 1.0 pytorch: 1.6 tensorflow: 1.15 OS: Ubuntu 18.04 Docker
Since upgrading to 0.8.7 and 1.0, we are experiencing multiple stability issues that result in jobs crashing with
The actor died unexpectedly before finishing this task
errors. Note that these issues are quite difficult to reproduce using the default environment provided by RLLib (often needs over 40 hours for QBert), but with our custom environment they happen much earlier during the execution — sometimes as early as 4 minutes, and they also happen very consistently. We’ve never experienced anything like this with 0.8.5 or prior. Memory/resource shouldn’t be the bottleneck. Even though our custom environments use more memory, we also use nodes with much larger memory capacity for their rollouts. We closely monitor them via Grafana to ensure that all usages fall well below what’s available (i.e. overall memory usage is usually far below 50%). For every node, we assign 30% of the node’s memory for object store, which should be far more than enough based on the experience/model size.Here’s an example of the errors (produced by the script provided later):
Here's another variant of the error when running our own custom environment:
Here's the example script that produced the first error by training QBert with PPO. Note that it might take over 40 hours for the error to occur. The setup is a p3.2xlarge instance for the trainer, and the rollout workers are on a c5.18xlarge instance. 30% of memory on each instance is dedicated to object store.
One of the things we tried when debugging the problem is by storing all execution ops references in memory — and somehow it helps. We discovered this mitigation almost accidentally as we were debugging our own execution plan. For instance, for the PPO execution plan, if we modify it to also return all execution ops in a list that gets held in memory, then the time it takes for the job to crash gets significantly increased and we no longer get the same error. Instead, the error becomes
ray.exceptions.ObjectLostError: Object XXXXX is lost due to node failure
-- which seems to be caused by some node failed heartbeat check. It’s unclear if our attempted mitigation is just a fluke or it may point in the right direction to fix the underlying problem, or these errors share the same underlying cause. Here’s a modified script. Note that the new error is no longer guaranteed to be reproducible even when running for a long time. But with our environment it's quite consistent:In the worker logs, we would find the following message around the time we get the object lost error:
Further, sometimes — not always, the node that timed out has a drastic sharp increase (2-3x) in memory usage according to our Grafana within several seconds near the end — which is far more than the amount of memory it should use. We attempted to mitigate this second error by increasing the
num_heartbeats_timeout
setting in--system_config
, but it doesn’t seem to make much difference. None of these issues exist with the old optimizer scheme in 0.8.5 or earlier and we can train with our custom environment for days without any issue.We also encounter problems that after a trial terminates, a new trial doesn’t get started for some reason in certain cases (this can only be reproduced with our environments). It’s unclear if that’s related to the issue above at all and it’s been hard to debug it with these other instability issues. We’ll likely file another more detailed bug report related to that later when this is addressed.