ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.42k stars 5.67k forks source link

Segmentation Fault when using multiprocessing.Queue #7793

Open qywu opened 4 years ago

qywu commented 4 years ago

What is the problem?

I am trying to use Ray to build a data processing pipeline, in which I need to use multiprocessing.Queue to store the object ids of the results. (previously, I used plasma directly and it has no problem.)

However, when using ray, Segmentation Fault occurs after some iterations. I have attached code to reproduce this error.

Ray version and other system information (Python version, TensorFlow version, OS): version: 0.9.0dev OS: ubuntu18.04

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

import ray
import time
import tqdm
from multiprocessing import Process, Queue

def _process(q):
    while True:
        obj_id = ray.put("123")
        q.put(obj_id)

if __name__ == "__main__":

    ray.init()

    queue = Queue(maxsize=5)

    p = Process(target=_process, args=(queue, ))

    p.start()

    for i in tqdm.trange(1000000):
        obj_id = queue.get()
        print(obj_id)
        item = ray.get(obj_id)
        print(item)

If we cannot run your script, we cannot fix your issue.

qywu commented 4 years ago

Also, here is the backtrace:

Thread 14 "grpc_global_tim" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffb47fff700 (LWP 7009)]
0x00007ffff5c2490a in grpc_core::LockfreeEvent::SetReady() ()
   from /home/wuqy1203/anaconda3/lib/python3.7/site-packages/ray/_raylet.so
(gdb) bt
#0  0x00007ffff5c2490a in grpc_core::LockfreeEvent::SetReady() ()
   from /home/wuqy1203/anaconda3/lib/python3.7/site-packages/ray/_raylet.so
#1  0x00007ffff5c200ec in pollable_process_events(grpc_pollset*, pollable*, bool) () from /home/wuqy1203/anaconda3/lib/python3.7/site-packages/ray/_raylet.so
#2  0x00007ffff5c2149e in pollset_work ()
   from /home/wuqy1203/anaconda3/lib/python3.7/site-packages/ray/_raylet.so
#3  0x00007ffff5bcdb02 in run_poller(void*, grpc_error*) ()
   from /home/wuqy1203/anaconda3/lib/python3.7/site-packages/ray/_raylet.so
#4  0x00007ffff5c1cda6 in exec_ctx_run(grpc_closure*, grpc_error*) ()
   from /home/wuqy1203/anaconda3/lib/python3.7/site-packages/ray/_raylet.so
#5  0x00007ffff5c1d0a7 in grpc_core::ExecCtx::Flush() ()
   from /home/wuqy1203/anaconda3/lib/python3.7/site-packages/ray/_raylet.so
#6  0x00007ffff5c0f62f in timer_thread(void*) ()
   from /home/wuqy1203/anaconda3/lib/python3.7/site-packages/ray/_raylet.so
#7  0x00007ffff5c2c562 in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) ()
   from /home/wuqy1203/anaconda3/lib/python3.7/site-packages/ray/_raylet.so
#8  0x00007ffff7bbd6db in start_thread (arg=0x7ffb47fff700)
    at pthread_create.c:463
#9  0x00007ffff78e688f in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
alasla commented 4 years ago

I didn't really dig into why, but I've also run across this. I don't get the segfault, but it doesn't work properly (ray.get() just hangs). The brief testing I did makes me think that Ray does not work well after the fork() that happens when spawning the Process()