Open Tenoke opened 4 years ago
cc @suquark? Looks like maybe there's some race condition?
Also @Tenoke can you provide a repro script?
I have run into a similar problem in a project I'm working on. I'm not able to provide a script to reproduce the problem (sadly the data is not ours, and the code is proprietary), but I have done some analysis, and can provide some background on the data we are processing.
Items of note:
From point 2 above:
/tmp
instead of /dev/shm
could be a common trigger in both cases. @Tenoke Is this fallback happening in your case?
From point 6 above:I'm no expert on reading Ray logs, but I think the issue is occurring when function arguments are being placed into shared memory prior to execution of the function.
Subset of log from m5a.8xlarge instance (32 virtual CPUs & 120 GB RAM):
2020-08-18 15:16:44,972 main INFO Starting coordinated build
2020-08-18 15:16:44,978 INFO resource_spec.py:231 -- Starting Ray with 72.61 GiB memory available for workers and up to 35.12 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-08-18 15:16:45,468 WARNING services.py:1567 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
...
(pid=124) 2020-08-18 15:17:16,258 - PID:124 - master_builder - [ INFO] - [Ptn 19/31, file 8/1000, date 1/1]: Processing 1,620 new observations
(pid=128) 2020-08-18 15:17:16,253 - PID:128 - master_builder - [ INFO] - [Ptn 23/31, file 3/1000, date 1/1]: Discarded: 1,486 repetitions of old data
*** Aborted at 1597763881 (unix time) try "date -d @1597763881" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGBUS (@0x7f16ec84cfe0) received by PID 1 (TID 0x7f1f1d866740) from PID 18446744073382711264; stack trace: ***
@ 0x7f1f1dbcb730 (unknown)
@ 0x7f1f1d9c5925 (unknown)
@ 0x7f1ebee85c56 __pyx_f_3ray_7_raylet_13Pickle5Writer_write_to()
@ 0x7f1ebee8a02b __pyx_f_3ray_7_raylet_23Pickle5SerializedObject_write_to()
@ 0x7f1ebee9044b __pyx_f_3ray_7_raylet_27MessagePackSerializedObject_write_to()
@ 0x7f1ebeedd142 __pyx_pf_3ray_7_raylet_10CoreWorker_22put_serialized_object()
@ 0x7f1ebeedd7ea __pyx_pw_3ray_7_raylet_10CoreWorker_23put_serialized_object()
@ 0x7f1f1dd2d855 PyCFunction_Call
@ 0x7f1ebee45204 __Pyx_PyObject_Call()
@ 0x7f1ebee4f9b0 __Pyx_PyObject_CallOneArg()
@ 0x7f1ebeed5844 __pyx_f_3ray_7_raylet_prepare_args()
@ 0x7f1ebeeda523 __pyx_pw_3ray_7_raylet_10CoreWorker_35submit_task()
@ 0x7f1f1dd2b90c _PyMethodDef_RawFastCallKeywords
@ 0x7f1f1dd3135d _PyMethodDescr_FastCallKeywords
@ 0x7f1f1dda0cc5 _PyEval_EvalFrameDefault
@ 0x7f1f1dd9bfa1 _PyEval_EvalCodeWithName
@ 0x7f1f1dd2c8a2 _PyFunction_FastCallKeywords
@ 0x7f1f1dd9d331 _PyEval_EvalFrameDefault
@ 0x7f1f1dd9bfa1 _PyEval_EvalCodeWithName
@ 0x7f1f1dd2c8a2 _PyFunction_FastCallKeywords
@ 0x7f1f1dd9dc0b _PyEval_EvalFrameDefault
@ 0x7f1f1dd9bfa1 _PyEval_EvalCodeWithName
@ 0x7f1f1dd2c8a2 _PyFunction_FastCallKeywords
@ 0x7f1f1dda0b02 _PyEval_EvalFrameDefault
@ 0x7f1f1dd9bfa1 _PyEval_EvalCodeWithName
@ 0x7f1f1dd2c8a2 _PyFunction_FastCallKeywords
@ 0x7f1f1dd9d331 _PyEval_EvalFrameDefault
@ 0x7f1f1dd9bfa1 _PyEval_EvalCodeWithName
@ 0x7f1f1dd2c8a2 _PyFunction_FastCallKeywords
@ 0x7f1f1dd9d331 _PyEval_EvalFrameDefault
@ 0x7f1f1dd2c7fa _PyFunction_FastCallKeywords
@ 0x7f1f1dd9d331 _PyEval_EvalFrameDefault
(pid=122) 2020-08-18 15:17:16,183 - PID:122 - master_builder - [ INFO] - [Ptn 9/31, file 14/1000, date 1/1]: Active records: 1,449, inactive: 67
A new run was triggered using the same scenario as above (32 CPUs, 120GB memory), it failed but with significant differences in the logs.
If we look at PID 111, clearly in the below log the process ran out of device space, but then managed to fail gracefully. In this case though (and many other PIDs showed the same behaviour), all the messages were related to logging failures, however all failed outside actor code.
(pid=111) OSError: [Errno 28] No space left on device
If we now look at PID 119, we see the same failure mode as previously, but again with no context to the failure, though in this case it looks like the failure was after invocation rather than during prep for invocation:
(pid=111) Arguments: ()
(pid=119) *** SIGBUS (@0x7f1ca25a2fe0) received by PID 119 (TID 0x7f23680c9740) from PID 18446744072138403808; stack trace: ***
(pid=119) @ 0x7f236842e730 (unknown)
(pid=119) @ 0x7f2368228930 (unknown)
(pid=119) @ 0x7f2367036c56 __pyx_f_3ray_7_raylet_13Pickle5Writer_write_to()
(pid=119) @ 0x7f236703b02b __pyx_f_3ray_7_raylet_23Pickle5SerializedObject_write_to()
(pid=119) @ 0x7f236704144b __pyx_f_3ray_7_raylet_27MessagePackSerializedObject_write_to()
(pid=119) @ 0x7f2367082dc5 __pyx_f_3ray_7_raylet_10CoreWorker_store_task_outputs()
We also see various RayletErrors raised throughout:
(pid=3376) ray.exceptions.RayletError: The Raylet died with this message: Connection reset by peer
(pid=3376) ray.exceptions.RayletError: The Raylet died with this message: Broken pipe
(pid=3376) ray.exceptions.RayletError: The Raylet died with this message: Broken pipe
(pid=3376) ray.exceptions.RayletError: The Raylet died with this message: Broken pipe
(pid=3476) ray.exceptions.RayletError: The Raylet died with this message: Connection reset by peer
(pid=3476) ray.exceptions.RayletError: The Raylet died with this message: Broken pipe
(pid=3476) ray.exceptions.RayletError: The Raylet died with this message: Broken pipe
(pid=3476) ray.exceptions.RayletError: The Raylet died with this message: Broken pipe
ray.exceptions.RayTaskError: ray::master.coordination.build_coordinator.builder() (pid=109, ip=...)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Log extract from later run (restricted to common logging and PIDs 111 and 119):
2020-08-19 10:44:07,336 main INFO Starting coordinated master build
2020-08-19 10:44:07,343INFO resource_spec.py:231 -- Starting Ray with 72.61 GiB memory available for workers and up to 35.12 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-08-19 10:44:07,825INFO services.py:1193 -- View the Ray dashboard at localhost:8265
2020-08-19 10:44:07,827WARNING services.py:1567 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
2020-08-19 10:44:07,835WARNING services.py:1567 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
2020-08-19 10:44:10,001 main INFO Locating data on S3
2020-08-19 10:44:12,951 main INFO Discovered 7,530 files to process
2020-08-19 10:44:15,998 master.coordination.build_coordinator INFO Coordination started
2020-08-19 10:44:15,998 master.coordination.build_coordinator DEBUG Building id generator and asset master config
2020-08-19 10:44:16,021 master.coordination.build_coordinator DEBUG Initialising partition manager
2020-08-19 10:44:16,057 master.coordination.build_coordinator DEBUG Launching builders
...
2020-08-19 10:44:41,628WARNING worker.py:1001 -- The driver may not be able to keep up with the stdout/stderr of the workers. To avoid forwarding logs to the driver, use 'ray.init(log_to_driver=False)'.
...
2020-08-19 10:44:42,875 master.coordination.build_coordinator DEBUG Waiting for builders to complete...
...
(pid=111) 2020-08-19 10:45:11,701 - PID:111 - master_builder
...
(pid=119) *** Aborted at 1597833911 (unix time) try date -d @1597833911 if you are using GNU date ***
...
(pid=111) --- Logging error ---
...
(pid=119) PC: @ 0x0 (unknown)
...
(pid=111) Traceback (most recent call last):
(pid=111) File /usr/local/lib/python3.7/logging/__init__.py, line 1028, in emit
(pid=111) stream.write(msg + self.terminator)
(pid=111) File /usr/local/lib/python3.7/site-packages/ray/utils.py, line 408, in write
(pid=111) self.stream.write(data)
(pid=111) OSError: [Errno 28] No space left on device
(pid=111) Call stack:
(pid=111) File /usr/local/lib/python3.7/site-packages/ray/workers/default_worker.py, line 123, in <module>
(pid=111) ray.worker.global_worker.main_loop()
(pid=111) File /usr/local/lib/python3.7/site-packages/ray/worker.py, line 371, in main_loop
(pid=111) self.core_worker.run_task_loop()
(pid=111) File /builder/master/coordination/build_coordinator.py, line 177, in builder
(pid=111) desc
(pid=111) File /builder/master/master_builder.py, line 635, in build_master_at_time
(pid=111) logger.info(f'{description}Date cut-off is: {cut_off}')
...
(pid=111) Arguments: ()
...
(pid=119) *** SIGBUS (@0x7f1ca25a2fe0) received by PID 119 (TID 0x7f23680c9740) from PID 18446744072138403808; stack trace: ***
(pid=119) @ 0x7f236842e730 (unknown)
(pid=119) @ 0x7f2368228930 (unknown)
(pid=119) @ 0x7f2367036c56 __pyx_f_3ray_7_raylet_13Pickle5Writer_write_to()
(pid=119) @ 0x7f236703b02b __pyx_f_3ray_7_raylet_23Pickle5SerializedObject_write_to()
(pid=119) @ 0x7f236704144b __pyx_f_3ray_7_raylet_27MessagePackSerializedObject_write_to()
(pid=119) @ 0x7f2367082dc5 __pyx_f_3ray_7_raylet_10CoreWorker_store_task_outputs()
(pid=119) @ 0x7f236705e3e2 __pyx_f_3ray_7_raylet_task_execution_handler()
(pid=119) @ 0x7f2366ff4944 std::_Function_handler<>::_M_invoke()
(pid=119) @ 0x7f23670f96ab ray::CoreWorker::ExecuteTask()
(pid=119) @ 0x7f236709660b std::_Function_handler<>::_M_invoke()
(pid=119) @ 0x7f236711735d _ZZN3ray28CoreWorkerDirectTaskReceiver14HandlePushTaskERKNS_3rpc15PushTaskRequestEPNS1_13PushTaskReplyESt8functionIFvNS_6StatusES7_IFvvEESA_EEENKUlvE_clEv
(pid=119) @ 0x7f2367115603 ray::SchedulingQueue::Add()
(pid=119) @ 0x7f2367118684 ray::CoreWorkerDirectTaskReceiver::HandlePushTask()
(pid=119) @ 0x7f23670db2ab _ZN5boost4asio6detail18completion_handlerIZN3ray10CoreWorker14HandlePushTaskERKNS3_3rpc15PushTaskRequestEPNS5_13PushTaskReplyESt8functionIFvNS3_6StatusESB_IFvvEESE_EEEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
(pid=119) @ 0x7f236754d86f boost::asio::detail::scheduler::do_run_one()
(pid=119) @ 0x7f236754e631 boost::asio::detail::scheduler::run()
(pid=119) @ 0x7f236754f582 boost::asio::io_context::run()
(pid=119) @ 0x7f23670f5b78 ray::CoreWorkerProcess::RunTaskExecutionLoop()
(pid=119) @ 0x7f2366ffac06 __pyx_pw_3ray_7_raylet_10CoreWorker_5run_task_loop()
(pid=119) @ 0x7f236858e8d9 _PyMethodDef_RawFastCallKeywords
(pid=119) @ 0x7f236859435d _PyMethodDescr_FastCallKeywords
(pid=119) @ 0x7f2368603cc5 _PyEval_EvalFrameDefault
(pid=119) @ 0x7f236858f7fa _PyFunction_FastCallKeywords
(pid=119) @ 0x7f2368600112 _PyEval_EvalFrameDefault
(pid=119) @ 0x7f23685fefa1 _PyEval_EvalCodeWithName
(pid=119) @ 0x7f23685feca9 PyEval_EvalCodeEx
(pid=119) @ 0x7f23685fec6b PyEval_EvalCode
(pid=119) @ 0x7f236868284e (unknown)
(pid=119) @ 0x7f2368681c07 PyRun_FileExFlags
(pid=119) @ 0x7f2368681a2e PyRun_SimpleFileExFlags
(pid=119) @ 0x7f2368686fd8 (unknown)
(pid=119) @ 0x7f2368686cce _Py_UnixMain
...
(pid=111) --- Logging error ---
(pid=111) Traceback (most recent call last):
(pid=111) File /usr/local/lib/python3.7/logging/__init__.py, line 1028, in emit
(pid=111) stream.write(msg + self.terminator)
(pid=111) File /usr/local/lib/python3.7/site-packages/ray/utils.py, line 408, in write
(pid=111) self.stream.write(data)
(pid=111) OSError: [Errno 28] No space left on device
(pid=111) Call stack:
(pid=111) File /usr/local/lib/python3.7/site-packages/ray/workers/default_worker.py, line 123, in <module>
(pid=111) ray.worker.global_worker.main_loop()
(pid=111) File /usr/local/lib/python3.7/site-packages/ray/worker.py, line 371, in main_loop
(pid=111) self.core_worker.run_task_loop()
(pid=111) File /builder/master/coordination/build_coordinator.py, line 177, in builder
(pid=111) desc
(pid=111) File /builder/master/master_builder.py, line 676, in build_master_at_time
(pid=111) logger.info(f'{description}Discarded: {pre - post:,} rows of old data')
...
(pid=111) Arguments: ()
...
Traceback (most recent call last):
File master-build.py, line 190, in <module>
main(**args)
File master-build.py, line 58, in main
master = parse_args_and_init_build(build_args)
File /builder/master/coordination/main.py, line 300, in parse_args_and_init_build
build_args.data_preload_size,
File /builder/master/coordination/build_coordinator.py, line 115, in coordinate_build_master
final_masters = ray.get(final_master_refs)
File /usr/local/lib/python3.7/site-packages/ray/worker.py, line 1538, in get
raise value.as_instanceof_cause()
...
File python/ray/_raylet.pyx, line 479, in ray._raylet.execute_task
File /builder/master/coordination/build_coordinator.py, line 154, in builder
new_data_ref = ray.get(partition_manager.get_frame.remote(path))
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
(pid=111) --- Logging error ---
(pid=111) Traceback (most recent call last):
(pid=111) File /usr/local/lib/python3.7/logging/__init__.py, line 1028, in emit
(pid=111) stream.write(msg + self.terminator)
(pid=111) File /usr/local/lib/python3.7/site-packages/ray/utils.py, line 408, in write
(pid=111) self.stream.write(data)
(pid=111) OSError: [Errno 28] No space left on device
(pid=111) Call stack:
(pid=111) File /usr/local/lib/python3.7/site-packages/ray/workers/default_worker.py, line 123, in <module>
(pid=111) ray.worker.global_worker.main_loop()
(pid=111) File /usr/local/lib/python3.7/site-packages/ray/worker.py, line 371, in main_loop
(pid=111) self.core_worker.run_task_loop()
(pid=111) File /builder/master/coordination/build_coordinator.py, line 177, in builder
(pid=111) desc
(pid=111) File /builder/master/master_builder.py, line 728, in build_master_at_time
(pid=111) + f'rows in master: {len(new_master):,}'
...
(pid=111) Arguments: ()
...
(pid=111) E0819 10:45:35.523273 111 465 task_manager.cc:323] Task failed: IOError: 14: failed to connect to all addresses: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=master.coordination.partition_manager, class_name=PartitionManager, function_name=mark_processed, function_hash=}, task_id=c1eb92e346d4d5baef0a6c220100, job_id=0100, num_args=2, num_returns=2, actor_task_spec={actor_id=ef0a6c220100, actor_caller_id=9d28cb176c7f7501ffffffff0100, actor_counter=57}
(pid=111) E0819 10:45:35.523409 111 465 task_manager.cc:323] Task failed: IOError: 14: failed to connect to all addresses: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=master.coordination.partition_manager, class_name=PartitionManager, function_name=get_frame, function_hash=}, task_id=0437d18e6aa6920fef0a6c220100, job_id=0100, num_args=2, num_returns=2, actor_task_spec={actor_id=ef0a6c220100, actor_caller_id=9d28cb176c7f7501ffffffff0100, actor_counter=58}
Did you guys figure out this issue?
@rkooo567 Sorry no, we just had to manually control the number of processes in flight.
What is the problem?
Ray version and other system information (Python version, TensorFlow version, OS):
Pytorch:latest Docker image., ray 0.8.6
Everything runs normally on less cores, however, everything was rewritten in ray in order to take advantage of multi-core systems. However, currently my code runs into this after ~1000 tasks on a 60 core machine (ec2 c5a 16xlarge). There is plenty of ram available (16% used), 1 worker per process etc.
Any chance you can point me in the direction of where the problem might even be?