Ray issue with serializing pytorch objects only when running on 40+ cores

Tenoke commented 4 years ago

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS):

*** SIGBUS (@0x7f92c504b000) received by PID 105 (TID 0x7f99b1015740) from PID 18446744072720003072; stack trace: ***
    @     0x7f99b0bec890 (unknown)
    @     0x7f99b0957fb9 (unknown)
    @     0x7f99af04c426 __pyx_f_3ray_7_raylet_13Pickle5Writer_write_to()
    @     0x7f99af050b4b __pyx_f_3ray_7_raylet_23Pickle5SerializedObject_write_to()
    @     0x7f99af06255b __pyx_f_3ray_7_raylet_27MessagePackSerializedObject_write_to()
    @     0x7f99af09dc05 __pyx_f_3ray_7_raylet_10CoreWorker_store_task_outputs()
    @     0x7f99af07c8a5 __pyx_f_3ray_7_raylet_task_execution_handler()
    @     0x7f99af01be54 std::_Function_handler<>::_M_invoke()
    @     0x7f99af10555a ray::CoreWorker::ExecuteTask()
    @     0x7f99af0af8cb std::_Function_handler<>::_M_invoke()
    @     0x7f99af121c11 _ZZN3ray28CoreWorkerDirectTaskReceiver14HandlePushTaskERKNS_3rpc15PushTaskRequestEPNS1_13PushTaskReplyESt8functionIFvNS_6StatusES7_IFvvEESA_EEENKUlvE_clEv
    @     0x7f99af120bf3 ray::SchedulingQueue::Add()
    @     0x7f99af123f07 ray::CoreWorkerDirectTaskReceiver::HandlePushTask()
    @     0x7f99af0ea8db _ZN5boost4asio6detail18completion_handlerIZN3ray10CoreWorker14HandlePushTaskERKNS3_3rpc15PushTaskRequestEPNS5_13PushTaskReplyESt8functionIFvNS3_6StatusESB_IFvvEESE_EEEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
    @     0x7f99af5e80df boost::asio::detail::scheduler::do_run_one()
    @     0x7f99af5e8cf1 boost::asio::detail::scheduler::run()
    @     0x7f99af5e9c42 boost::asio::io_context::run()
    @     0x7f99af102ea0 ray::CoreWorkerProcess::RunTaskExecutionLoop()
    @     0x7f99af022186 __pyx_pw_3ray_7_raylet_10CoreWorker_5run_task_loop()
    @     0x55c1df998631 _PyMethodDef_RawFastCallKeywords
    @     0x55c1df9a03cf _PyMethodDescr_FastCallKeywords
    @     0x55c1dfa0407c _PyEval_Step 1/14 : From pytorch/pytorch:latest

Everything runs normally on less cores, however, everything was rewritten in ray in order to take advantage of multi-core systems. However, currently my code runs into this after ~1000 tasks on a 60 core machine (ec2 c5a 16xlarge). There is plenty of ram available (16% used), 1 worker per process etc.

Any chance you can point me in the direction of where the problem might even be?

richardliaw commented 4 years ago

cc @suquark? Looks like maybe there's some race condition?

Also @Tenoke can you provide a repro script?

ChristianECooper commented 4 years ago

I have run into a similar problem in a project I'm working on. I'm not able to provide a script to reproduce the problem (sadly the data is not ours, and the code is proprietary), but I have done some analysis, and can provide some background on the data we are processing.

Items of note:

Using Python 3.7.7, Pandas 0.25.3, and Ray 0.8.7
Pytorch is not used in my project. The processing is lots of Pandas manipulation of DataFrames partitioned for parallel processing with Ray (partitioned into CPUs-1 partitions).
We are seeing the problem occur on a 32 core box inside a Docker container running under an EC2 instance triggered by AWS Batch.
The log below came from an m5a.8xlarge instance (with 32 virtual CPUs & 120 GB RAM). More testing will be going on today where we vary the memory and CPU availability to see what reliably triggers the problem, if I can find anything conclusive I'll add the details later.
AWS Batch does not provide a way to increase the limited /dev/shm memory backed storage (and some is already claimed by Docker by the time we spin Ray up). So, as you can see from the head of the log below Ray reverts to using /tmp. (Details from Amazon on the subject of /dev/shm management: https://forums.aws.amazon.com/thread.jspa?messageID=879337)
I have an option to preload DataFrames into our application prior to Ray orchestrating the processing, with the preload of 1,000 DataFrames the process failed, but with 50 preloaded on the same box the process succeeds. The preloaded 1,000 DataFrames take up approx. 6.6 GB on disk in Parquet format, and around 120Gb GB once expanded and loaded into memory. Note: There are no memory allocation warnings in the logs.
Skipping docker/AWS and running the process on a MacBook Pro, (12 cores and 32GB RAM) works fine with a reduced preload due to memory constraints.
Skipping docker/AWS and running the process directly on Arch Linux (64 cores and 0.5TB RAM) works fine irrespective of preload.

From point 2 above:

This would indicate pytorch can trigger the issue but is not the only potential trigger. From point 5 above:
Reverting to /tmp instead of /dev/shm could be a common trigger in both cases. @Tenoke Is this fallback happening in your case? From point 6 above:
This would tend to indicate that memory usage/availability is part of the issue. From points 7 & 8 above:
This would tend to indicate that number of cores is not part of the issue, but running under Docker or on a standard AWS Batch managed EC2 instance is part of the issue.

I'm no expert on reading Ray logs, but I think the issue is occurring when function arguments are being placed into shared memory prior to execution of the function.

Subset of log from m5a.8xlarge instance (32 virtual CPUs & 120 GB RAM):

2020-08-18 15:16:44,972 main INFO Starting coordinated build
2020-08-18 15:16:44,978 INFO resource_spec.py:231 -- Starting Ray with 72.61 GiB memory available for workers and up to 35.12 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-08-18 15:16:45,468 WARNING services.py:1567 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.

...

(pid=124) 2020-08-18 15:17:16,258 - PID:124 - master_builder - [ INFO] - [Ptn 19/31, file 8/1000, date 1/1]: Processing 1,620 new observations
(pid=128) 2020-08-18 15:17:16,253 - PID:128 - master_builder - [ INFO] - [Ptn 23/31, file 3/1000, date 1/1]: Discarded: 1,486 repetitions of old data
*** Aborted at 1597763881 (unix time) try "date -d @1597763881" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGBUS (@0x7f16ec84cfe0) received by PID 1 (TID 0x7f1f1d866740) from PID 18446744073382711264; stack trace: ***
    @     0x7f1f1dbcb730 (unknown)
    @     0x7f1f1d9c5925 (unknown)
    @     0x7f1ebee85c56 __pyx_f_3ray_7_raylet_13Pickle5Writer_write_to()
    @     0x7f1ebee8a02b __pyx_f_3ray_7_raylet_23Pickle5SerializedObject_write_to()
    @     0x7f1ebee9044b __pyx_f_3ray_7_raylet_27MessagePackSerializedObject_write_to()
    @     0x7f1ebeedd142 __pyx_pf_3ray_7_raylet_10CoreWorker_22put_serialized_object()
    @     0x7f1ebeedd7ea __pyx_pw_3ray_7_raylet_10CoreWorker_23put_serialized_object()
    @     0x7f1f1dd2d855 PyCFunction_Call
    @     0x7f1ebee45204 __Pyx_PyObject_Call()
    @     0x7f1ebee4f9b0 __Pyx_PyObject_CallOneArg()
    @     0x7f1ebeed5844 __pyx_f_3ray_7_raylet_prepare_args()
    @     0x7f1ebeeda523 __pyx_pw_3ray_7_raylet_10CoreWorker_35submit_task()
    @     0x7f1f1dd2b90c _PyMethodDef_RawFastCallKeywords
    @     0x7f1f1dd3135d _PyMethodDescr_FastCallKeywords
    @     0x7f1f1dda0cc5 _PyEval_EvalFrameDefault
    @     0x7f1f1dd9bfa1 _PyEval_EvalCodeWithName
    @     0x7f1f1dd2c8a2 _PyFunction_FastCallKeywords
    @     0x7f1f1dd9d331 _PyEval_EvalFrameDefault
    @     0x7f1f1dd9bfa1 _PyEval_EvalCodeWithName
    @     0x7f1f1dd2c8a2 _PyFunction_FastCallKeywords
    @     0x7f1f1dd9dc0b _PyEval_EvalFrameDefault
    @     0x7f1f1dd9bfa1 _PyEval_EvalCodeWithName
    @     0x7f1f1dd2c8a2 _PyFunction_FastCallKeywords
    @     0x7f1f1dda0b02 _PyEval_EvalFrameDefault
    @     0x7f1f1dd9bfa1 _PyEval_EvalCodeWithName
    @     0x7f1f1dd2c8a2 _PyFunction_FastCallKeywords
    @     0x7f1f1dd9d331 _PyEval_EvalFrameDefault
    @     0x7f1f1dd9bfa1 _PyEval_EvalCodeWithName
    @     0x7f1f1dd2c8a2 _PyFunction_FastCallKeywords
    @     0x7f1f1dd9d331 _PyEval_EvalFrameDefault
    @     0x7f1f1dd2c7fa _PyFunction_FastCallKeywords
    @     0x7f1f1dd9d331 _PyEval_EvalFrameDefault
(pid=122) 2020-08-18 15:17:16,183 - PID:122 - master_builder - [ INFO] - [Ptn 9/31, file 14/1000, date 1/1]: Active records: 1,449, inactive: 67

ChristianECooper commented 4 years ago

A new run was triggered using the same scenario as above (32 CPUs, 120GB memory), it failed but with significant differences in the logs.

If we look at PID 111, clearly in the below log the process ran out of device space, but then managed to fail gracefully. In this case though (and many other PIDs showed the same behaviour), all the messages were related to logging failures, however all failed outside actor code.

(pid=111) OSError: [Errno 28] No space left on device

If we now look at PID 119, we see the same failure mode as previously, but again with no context to the failure, though in this case it looks like the failure was after invocation rather than during prep for invocation:

(pid=111) Arguments: ()
(pid=119) *** SIGBUS (@0x7f1ca25a2fe0) received by PID 119 (TID 0x7f23680c9740) from PID 18446744072138403808; stack trace: ***
(pid=119)     @     0x7f236842e730 (unknown)
(pid=119)     @     0x7f2368228930 (unknown)
(pid=119)     @     0x7f2367036c56 __pyx_f_3ray_7_raylet_13Pickle5Writer_write_to()
(pid=119)     @     0x7f236703b02b __pyx_f_3ray_7_raylet_23Pickle5SerializedObject_write_to()
(pid=119)     @     0x7f236704144b __pyx_f_3ray_7_raylet_27MessagePackSerializedObject_write_to()
(pid=119)     @     0x7f2367082dc5 __pyx_f_3ray_7_raylet_10CoreWorker_store_task_outputs()

We also see various RayletErrors raised throughout:

(pid=3376) ray.exceptions.RayletError: The Raylet died with this message: Connection reset by peer
(pid=3376) ray.exceptions.RayletError: The Raylet died with this message: Broken pipe
(pid=3376) ray.exceptions.RayletError: The Raylet died with this message: Broken pipe
(pid=3376) ray.exceptions.RayletError: The Raylet died with this message: Broken pipe
(pid=3476) ray.exceptions.RayletError: The Raylet died with this message: Connection reset by peer
(pid=3476) ray.exceptions.RayletError: The Raylet died with this message: Broken pipe
(pid=3476) ray.exceptions.RayletError: The Raylet died with this message: Broken pipe
(pid=3476) ray.exceptions.RayletError: The Raylet died with this message: Broken pipe
ray.exceptions.RayTaskError: ray::master.coordination.build_coordinator.builder() (pid=109, ip=...)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Log extract from later run (restricted to common logging and PIDs 111 and 119):

2020-08-19 10:44:07,336 main INFO Starting coordinated master build
2020-08-19 10:44:07,343INFO resource_spec.py:231 -- Starting Ray with 72.61 GiB memory available for workers and up to 35.12 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-08-19 10:44:07,825INFO services.py:1193 -- View the Ray dashboard at localhost:8265
2020-08-19 10:44:07,827WARNING services.py:1567 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
2020-08-19 10:44:07,835WARNING services.py:1567 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
2020-08-19 10:44:10,001 main INFO Locating data on S3
2020-08-19 10:44:12,951 main INFO Discovered 7,530 files to process
2020-08-19 10:44:15,998 master.coordination.build_coordinator INFO Coordination started
2020-08-19 10:44:15,998 master.coordination.build_coordinator DEBUG Building id generator and asset master config
2020-08-19 10:44:16,021 master.coordination.build_coordinator DEBUG Initialising partition manager
2020-08-19 10:44:16,057 master.coordination.build_coordinator DEBUG Launching builders
...
2020-08-19 10:44:41,628WARNING worker.py:1001 -- The driver may not be able to keep up with the stdout/stderr of the workers. To avoid forwarding logs to the driver, use 'ray.init(log_to_driver=False)'.
...
2020-08-19 10:44:42,875 master.coordination.build_coordinator DEBUG Waiting for builders to complete...
...
(pid=111) 2020-08-19 10:45:11,701 - PID:111 - master_builder 
...
(pid=119) *** Aborted at 1597833911 (unix time) try date -d @1597833911 if you are using GNU date ***
...
(pid=111) --- Logging error ---
...
(pid=119) PC: @                0x0 (unknown)
...
(pid=111) Traceback (most recent call last):
(pid=111)   File /usr/local/lib/python3.7/logging/__init__.py, line 1028, in emit
(pid=111)     stream.write(msg + self.terminator)
(pid=111)   File /usr/local/lib/python3.7/site-packages/ray/utils.py, line 408, in write
(pid=111)     self.stream.write(data)
(pid=111) OSError: [Errno 28] No space left on device
(pid=111) Call stack:
(pid=111)   File /usr/local/lib/python3.7/site-packages/ray/workers/default_worker.py, line 123, in <module>
(pid=111)     ray.worker.global_worker.main_loop()
(pid=111)   File /usr/local/lib/python3.7/site-packages/ray/worker.py, line 371, in main_loop
(pid=111)     self.core_worker.run_task_loop()
(pid=111)   File /builder/master/coordination/build_coordinator.py, line 177, in builder
(pid=111)     desc
(pid=111)   File /builder/master/master_builder.py, line 635, in build_master_at_time
(pid=111)     logger.info(f'{description}Date cut-off is: {cut_off}')
...
(pid=111) Arguments: ()
...
(pid=119) *** SIGBUS (@0x7f1ca25a2fe0) received by PID 119 (TID 0x7f23680c9740) from PID 18446744072138403808; stack trace: ***
(pid=119)     @     0x7f236842e730 (unknown)
(pid=119)     @     0x7f2368228930 (unknown)
(pid=119)     @     0x7f2367036c56 __pyx_f_3ray_7_raylet_13Pickle5Writer_write_to()
(pid=119)     @     0x7f236703b02b __pyx_f_3ray_7_raylet_23Pickle5SerializedObject_write_to()
(pid=119)     @     0x7f236704144b __pyx_f_3ray_7_raylet_27MessagePackSerializedObject_write_to()
(pid=119)     @     0x7f2367082dc5 __pyx_f_3ray_7_raylet_10CoreWorker_store_task_outputs()
(pid=119)     @     0x7f236705e3e2 __pyx_f_3ray_7_raylet_task_execution_handler()
(pid=119)     @     0x7f2366ff4944 std::_Function_handler<>::_M_invoke()
(pid=119)     @     0x7f23670f96ab ray::CoreWorker::ExecuteTask()
(pid=119)     @     0x7f236709660b std::_Function_handler<>::_M_invoke()
(pid=119)     @     0x7f236711735d _ZZN3ray28CoreWorkerDirectTaskReceiver14HandlePushTaskERKNS_3rpc15PushTaskRequestEPNS1_13PushTaskReplyESt8functionIFvNS_6StatusES7_IFvvEESA_EEENKUlvE_clEv
(pid=119)     @     0x7f2367115603 ray::SchedulingQueue::Add()
(pid=119)     @     0x7f2367118684 ray::CoreWorkerDirectTaskReceiver::HandlePushTask()
(pid=119)     @     0x7f23670db2ab _ZN5boost4asio6detail18completion_handlerIZN3ray10CoreWorker14HandlePushTaskERKNS3_3rpc15PushTaskRequestEPNS5_13PushTaskReplyESt8functionIFvNS3_6StatusESB_IFvvEESE_EEEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
(pid=119)     @     0x7f236754d86f boost::asio::detail::scheduler::do_run_one()
(pid=119)     @     0x7f236754e631 boost::asio::detail::scheduler::run()
(pid=119)     @     0x7f236754f582 boost::asio::io_context::run()
(pid=119)     @     0x7f23670f5b78 ray::CoreWorkerProcess::RunTaskExecutionLoop()
(pid=119)     @     0x7f2366ffac06 __pyx_pw_3ray_7_raylet_10CoreWorker_5run_task_loop()
(pid=119)     @     0x7f236858e8d9 _PyMethodDef_RawFastCallKeywords
(pid=119)     @     0x7f236859435d _PyMethodDescr_FastCallKeywords
(pid=119)     @     0x7f2368603cc5 _PyEval_EvalFrameDefault
(pid=119)     @     0x7f236858f7fa _PyFunction_FastCallKeywords
(pid=119)     @     0x7f2368600112 _PyEval_EvalFrameDefault
(pid=119)     @     0x7f23685fefa1 _PyEval_EvalCodeWithName
(pid=119)     @     0x7f23685feca9 PyEval_EvalCodeEx
(pid=119)     @     0x7f23685fec6b PyEval_EvalCode
(pid=119)     @     0x7f236868284e (unknown)
(pid=119)     @     0x7f2368681c07 PyRun_FileExFlags
(pid=119)     @     0x7f2368681a2e PyRun_SimpleFileExFlags
(pid=119)     @     0x7f2368686fd8 (unknown)
(pid=119)     @     0x7f2368686cce _Py_UnixMain
...
(pid=111) --- Logging error ---
(pid=111) Traceback (most recent call last):
(pid=111)   File /usr/local/lib/python3.7/logging/__init__.py, line 1028, in emit
(pid=111)     stream.write(msg + self.terminator)
(pid=111)   File /usr/local/lib/python3.7/site-packages/ray/utils.py, line 408, in write
(pid=111)     self.stream.write(data)
(pid=111) OSError: [Errno 28] No space left on device
(pid=111) Call stack:
(pid=111)   File /usr/local/lib/python3.7/site-packages/ray/workers/default_worker.py, line 123, in <module>
(pid=111)     ray.worker.global_worker.main_loop()
(pid=111)   File /usr/local/lib/python3.7/site-packages/ray/worker.py, line 371, in main_loop
(pid=111)     self.core_worker.run_task_loop()
(pid=111)   File /builder/master/coordination/build_coordinator.py, line 177, in builder
(pid=111)     desc
(pid=111)   File /builder/master/master_builder.py, line 676, in build_master_at_time
(pid=111)     logger.info(f'{description}Discarded: {pre - post:,} rows of old data')
...
(pid=111) Arguments: ()
...
Traceback (most recent call last):
  File master-build.py, line 190, in <module>
    main(**args)
  File master-build.py, line 58, in main
    master = parse_args_and_init_build(build_args)
  File /builder/master/coordination/main.py, line 300, in parse_args_and_init_build
    build_args.data_preload_size,
  File /builder/master/coordination/build_coordinator.py, line 115, in coordinate_build_master
    final_masters = ray.get(final_master_refs)
  File /usr/local/lib/python3.7/site-packages/ray/worker.py, line 1538, in get
    raise value.as_instanceof_cause()
...
  File python/ray/_raylet.pyx, line 479, in ray._raylet.execute_task
  File /builder/master/coordination/build_coordinator.py, line 154, in builder
    new_data_ref = ray.get(partition_manager.get_frame.remote(path))
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
(pid=111) --- Logging error ---
(pid=111) Traceback (most recent call last):
(pid=111)   File /usr/local/lib/python3.7/logging/__init__.py, line 1028, in emit
(pid=111)     stream.write(msg + self.terminator)
(pid=111)   File /usr/local/lib/python3.7/site-packages/ray/utils.py, line 408, in write
(pid=111)     self.stream.write(data)
(pid=111) OSError: [Errno 28] No space left on device
(pid=111) Call stack:
(pid=111)   File /usr/local/lib/python3.7/site-packages/ray/workers/default_worker.py, line 123, in <module>
(pid=111)     ray.worker.global_worker.main_loop()
(pid=111)   File /usr/local/lib/python3.7/site-packages/ray/worker.py, line 371, in main_loop
(pid=111)     self.core_worker.run_task_loop()
(pid=111)   File /builder/master/coordination/build_coordinator.py, line 177, in builder
(pid=111)     desc
(pid=111)   File /builder/master/master_builder.py, line 728, in build_master_at_time
(pid=111)     + f'rows in master: {len(new_master):,}'
...
(pid=111) Arguments: ()
...
(pid=111) E0819 10:45:35.523273   111   465 task_manager.cc:323] Task failed: IOError: 14: failed to connect to all addresses: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=master.coordination.partition_manager, class_name=PartitionManager, function_name=mark_processed, function_hash=}, task_id=c1eb92e346d4d5baef0a6c220100, job_id=0100, num_args=2, num_returns=2, actor_task_spec={actor_id=ef0a6c220100, actor_caller_id=9d28cb176c7f7501ffffffff0100, actor_counter=57}
(pid=111) E0819 10:45:35.523409   111   465 task_manager.cc:323] Task failed: IOError: 14: failed to connect to all addresses: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=master.coordination.partition_manager, class_name=PartitionManager, function_name=get_frame, function_hash=}, task_id=0437d18e6aa6920fef0a6c220100, job_id=0100, num_args=2, num_returns=2, actor_task_spec={actor_id=ef0a6c220100, actor_caller_id=9d28cb176c7f7501ffffffff0100, actor_counter=58}

rkooo567 commented 4 years ago

Did you guys figure out this issue?

ChristianECooper commented 4 years ago

@rkooo567 Sorry no, we just had to manually control the number of processes in flight.

ray-project / ray

Ray issue with serializing pytorch objects only when running on 40+ cores #9752

What is the problem?