google-deepmind / acme

A library of reinforcement learning components and agents
Apache License 2.0
3.52k stars 426 forks source link

Segmentation fault in using the new version of LocalLayout #235

Closed ethanluoyc closed 2 years ago

ethanluoyc commented 2 years ago

I manage to avoid blocking by setting a smaller value for the max in flight samples in https://github.com/deepmind/acme/issues/233 but now I get a new error segmentation fault error.

https://github.com/deepmind/acme/issues/233

Thread 0x00007f9bfe7fc700 (most recent call first):
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 78 in _worker
  File "/usr/lib/python3.8/threading.py", line 870 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f9c9673e700 (most recent call first):
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 78 in _worker
  File "/usr/lib/python3.8/threading.py", line 870 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f9cca7fc700 (most recent call first):
  File "/home/yicheng/virtualenvs/ot/lib/python3.8/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2918 in iterator_get_next
  File "/home/yicheng/virtualenvs/ot/lib/python3.8/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 819 in _next_internal
  File "/home/yicheng/virtualenvs/ot/lib/python3.8/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 836 in __next__
  File "/home/yicheng/virtualenvs/ot/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4407 in __next__
  File "/home/yicheng/projects/optimal_transport/scratch/acme/acme/jax/utils.py", line 247 in __next__
  File "/home/yicheng/projects/optimal_transport/scratch/acme/acme/jax/utils.py", line 539 in producer
  File "/usr/lib/python3.8/threading.py", line 870 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x00007fa830a6d740 (most recent call first):
  File "/home/yicheng/virtualenvs/ot/lib/python3.8/site-packages/reverb/reverb_types.py", line 80 in from_serialized_proto
  File "/home/yicheng/virtualenvs/ot/lib/python3.8/site-packages/reverb/server.py", line 229 in info
  File "/home/yicheng/projects/optimal_transport/scratch/acme/acme/agents/agent.py", line 107 in <listcomp>
  File "/home/yicheng/projects/optimal_transport/scratch/acme/acme/agents/agent.py", line 106 in update
  File "/home/yicheng/projects/optimal_transport/scratch/acme/acme/jax/layouts/local_layout.py", line 140 in update
  File "/home/yicheng/projects/optimal_transport/scratch/acme/acme/environment_loop.py", line 115 in run_episode
  File "/home/yicheng/projects/optimal_transport/scratch/acme/acme/environment_loop.py", line 176 in run
  File "/home/yicheng/projects/optimal_transport/ilax/experiments/run_drq_v2.py", line 138 in main
  File "/home/yicheng/virtualenvs/ot/lib/python3.8/site-packages/absl/app.py", line 258 in _run_main
  File "/home/yicheng/virtualenvs/ot/lib/python3.8/site-packages/absl/app.py", line 312 in run
  File "/home/yicheng/projects/optimal_transport/ilax/experiments/run_drq_v2.py", line 152 in <module>
  File "/usr/lib/python3.8/runpy.py", line 87 in _run_code
  File "/usr/lib/python3.8/runpy.py", line 194 in _run_module_as_main
zsh: segmentation fault (core dumped)  CUDA_VISIBLE_DEVICES=0 WANDB_NAME="drq_fix_deadlock" =0.80 MUJOCO_GL="egl"  -

Looks like the issue comes from https://github.com/deepmind/acme/blob/2871e3216d2ffc2bc0ffea8b6a0e3071897608b9/acme/agents/agent.py#L105-L108. When deserializing the protobuf string from the internal table, a segmentation fault happened. I can occasionally get an error saying that 'INVALID_ARGUMENT: Python buffer protocol is only defined for CPU buffers' but I haven't been able to consistently reproduce that.

qstanczyk commented 2 years ago

Does it happen at the beginning of the training or at some point later on? Can this issue be reproduced with the example you shared in #233?

ethanluoyc commented 2 years ago

Yeah, it does happen, I only changed the num_parallel samples to 1. It happens towards the end. e.g. my most recent run failed at 1484000 steps. There appear to be two ways of crashing. One is a segmentation fault, and the other is the buffer protocol error. In my experience disabling the device put in creating the dataset, iterator seems to help. Another data point: running distributed works fine. I tried to debug this a bit more but haven’t setup building reverb from source. The debug build seems out of sync.

qstanczyk commented 2 years ago

Can't repro it so far ;-(

ethanluoyc commented 2 years ago

Oh no... It does take a really long time for me to hit this (normally near the end of 1.5M steps), it could also be a difference in the version of the dependency (TensorFlow, protobuf, etc). I will share here if I have more leads

ethanluoyc commented 2 years ago

I updated my code to be compatible with I set up a gdb session to see what the d303127 (the changes to the builder interface), and caught the issue with gdb.

Here's the backtrace

#1  0x00000000005a9c18 in PyType_GenericAlloc ()
#2  0x00007fff7165a5bd in xla::PyBuffer::Make(std::shared_ptr<xla::PyClient>, std::shared_ptr<xla::PjRtBuffer>, std::shared_ptr<xla::Traceback>) ()
   from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#3  0x00007fff71666228 in xla::PyClient::BufferFromPyval(pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics) ()
   from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#4  0x00007fff713f608d in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<tensorflow::StatusOr<pybind11::object>, xla::PyClient, pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(tensorflow::StatusOr<pybind11::object> (xla::PyClient::*)(pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(xla::PyClient*, pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics)#1}, tensorflow::StatusOr<pybind11::object>, xla::PyClient*, pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(pybind11::cpp_function::initialize<tensorflow::StatusOr<pybind11::object>, xla::PyClient, pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(tensorflow::StatusOr<pybind11::object> (xla::PyClient::*)(pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(xla::PyClient*, pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics)#1}&&, tensorflow::StatusOr<pybind11::object> (*)(xla::PyClient*, pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call) const ()
   from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#5  0x00007fff713edc7b in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
   from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#6  0x00000000005f3989 in PyCFunction_Call ()
#7  0x00000000005f3e1e in _PyObject_MakeTpCall ()
#8  0x000000000050b183 in ?? ()
#9  0x0000000000570035 in _PyEval_EvalFrameDefault ()
#10 0x00000000005f6836 in _PyFunction_Vectorcall ()
#11 0x000000000056b0ae in _PyEval_EvalFrameDefault ()
#12 0x000000000056939a in _PyEval_EvalCodeWithName ()
#13 0x00000000005f6a13 in _PyFunction_Vectorcall ()
#14 0x00000000005f3547 in PyObject_Call ()
#15 0x000000000056c8cd in _PyEval_EvalFrameDefault ()
--Type <RET> for more, q to quit, c to continue without paging--
#16 0x00000000005f6836 in _PyFunction_Vectorcall ()
#17 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#18 0x00000000005f6836 in _PyFunction_Vectorcall ()
#19 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#20 0x000000000056939a in _PyEval_EvalCodeWithName ()
#21 0x000000000050aaa0 in ?? ()
#22 0x000000000056c28c in _PyEval_EvalFrameDefault ()
#23 0x000000000056939a in _PyEval_EvalCodeWithName ()
#24 0x00000000005f6a13 in _PyFunction_Vectorcall ()
#25 0x00000000005f3547 in PyObject_Call ()
#26 0x000000000056c8cd in _PyEval_EvalFrameDefault ()
#27 0x00000000005006d4 in ?? ()
#28 0x0000000000510b02 in PyIter_Next ()
#29 0x00007fff7146c6a6 in pybind11::iterator::advance() () from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#30 0x00007fff7160c4d5 in pybind11::object xla::PyTreeDef::UnflattenImpl<pybind11::iterable>(pybind11::iterable) const ()
   from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#31 0x00007fff7160c98d in xla::PyTreeDef::Unflatten(pybind11::iterable) const ()
   from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#32 0x00007fff7160854b in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<pybind11::object, xla::PyTreeDef, pybind11::iterable, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::object (xla::PyTreeDef::*)(pybind11::iterable) const, pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(xla::PyTreeDef const*, pybind11::iterable)#1}, pybind11::object, xla::PyTreeDef const*, pybind11::iterable, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::cpp_function::initialize<pybind11::object, xla::PyTreeDef, pybind11::iterable, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::object (xla::PyTreeDef::*)(pybind11::iterable) const, pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(xla::PyTreeDef const*, pybind11::iterable)#1}&&, pybind11::object (*)(xla::PyTreeDef const*, pybind11::iterable), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#33 0x00007fff713edc7b in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
   from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#34 0x00000000005f3989 in PyCFunction_Call ()
#35 0x00000000005f3e1e in _PyObject_MakeTpCall ()
#36 0x000000000050b183 in ?? ()
#37 0x0000000000570035 in _PyEval_EvalFrameDefault ()
#38 0x000000000056939a in _PyEval_EvalCodeWithName ()
--Type <RET> for more, q to quit, c to continue without paging--
#39 0x00000000005f6a13 in _PyFunction_Vectorcall ()
#40 0x000000000056b0ae in _PyEval_EvalFrameDefault ()
#41 0x000000000056939a in _PyEval_EvalCodeWithName ()
#42 0x00000000005f6a13 in _PyFunction_Vectorcall ()
#43 0x0000000000570035 in _PyEval_EvalFrameDefault ()
#44 0x00000000005f6836 in _PyFunction_Vectorcall ()
#45 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#46 0x000000000056939a in _PyEval_EvalCodeWithName ()
#47 0x00000000005f6a13 in _PyFunction_Vectorcall ()
#48 0x000000000056b0ae in _PyEval_EvalFrameDefault ()
#49 0x00000000005f6836 in _PyFunction_Vectorcall ()
#50 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#51 0x00000000005f6836 in _PyFunction_Vectorcall ()
#52 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#53 0x00000000005f6836 in _PyFunction_Vectorcall ()
#54 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#55 0x00000000005f6836 in _PyFunction_Vectorcall ()
#56 0x00000000005f3547 in PyObject_Call ()
#57 0x000000000056c8cd in _PyEval_EvalFrameDefault ()
#58 0x00000000005f6836 in _PyFunction_Vectorcall ()
#59 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#60 0x00000000005f6836 in _PyFunction_Vectorcall ()
#61 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#62 0x00000000005f6836 in _PyFunction_Vectorcall ()
#63 0x000000000050aa2c in ?? ()
#64 0x00000000005f3547 in PyObject_Call ()
#65 0x0000000000655a9c in ?? ()
#66 0x0000000000675738 in ?? ()
#67 0x00007ffff7da0609 in start_thread (arg=<optimised out>) at pthread_create.c:477
#68 0x00007ffff7eda163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

It looks like there's an issue with jaxlib xla_extension, so maybe this is not an acme issue.

ethanluoyc commented 2 years ago

OK, I have at least managed to find a way to deterministically trigger the memory allocation issue by calling table.info.

Use acme master branch, with the dependencies installed, set PYTHONMALLOC=malloc_debug then run the sac example.

venv ❯ python run_sac.py --env_name control:cheetah:run
I0612 04:34:16.762904 140433669838656 __init__.py:69] MUJOCO_GL is not set, so an OpenGL backend will be chosen automatically.
/home/yicheng/projects/acme/venv/lib/python3.8/site-packages/glfw/__init__.py:906: GLFWError: (65544) b'X11: The DISPLAY environment variable is missing'
  warnings.warn(message, GLFWError)
I0612 04:34:16.835770 140433669838656 __init__.py:77] Successfully imported OpenGL backend: glfw
I0612 04:34:16.915179 140433669838656 __init__.py:31] MuJoCo library version is: 200
I0612 04:34:17.078244 140433669838656 xla_bridge.py:260] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
I0612 04:34:17.078425 140433669838656 xla_bridge.py:260] Unable to initialize backend 'gpu': NOT_FOUND: Could not find registered platform with name: "cuda". Available platform names are: Host Interpreter
I0612 04:34:17.078679 140433669838656 xla_bridge.py:260] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
W0612 04:34:17.078754 140433669838656 xla_bridge.py:265] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Fatal Python error: Python memory allocator called without holding the GIL
Python runtime state: initialized

Current thread 0x00007fb943015740 (most recent call first):
  File "/home/yicheng/projects/acme/venv/lib/python3.8/site-packages/reverb/server.py", line 228 in info
  File "/home/yicheng/projects/acme/acme/jax/experiments/run_experiment.py", line 93 in _disable_insert_blocking
  File "/home/yicheng/projects/acme/acme/jax/experiments/run_experiment.py", line 136 in <listcomp>
  File "/home/yicheng/projects/acme/acme/jax/experiments/run_experiment.py", line 136 in run_experiment
  File "run_sac.py", line 82 in main
  File "/home/yicheng/projects/acme/venv/lib/python3.8/site-packages/absl/app.py", line 258 in _run_main
  File "/home/yicheng/projects/acme/venv/lib/python3.8/site-packages/absl/app.py", line 312 in run
  File "run_sac.py", line 89 in <module>
zsh: abort (core dumped)  python run_sac.py --env_name control:cheetah:run

I have created a fix to this issue.

qstanczyk commented 2 years ago

Thanks for the great investigation of the problem and a fix. Your Pull requests is now merged in Reverb, while I have another change to the run_experiment.py in flight, which eliminates the use of Table's info (it should run faster).

ethanluoyc commented 2 years ago

Sounds good. Eliminating table.info all together would be nice, right now in order to pick up my fix I have to build launchpad and reverb by myself, and I think launchpad nightly has not been built lately.