Closed ethanluoyc closed 2 years ago
Does it happen at the beginning of the training or at some point later on? Can this issue be reproduced with the example you shared in #233?
Yeah, it does happen, I only changed the num_parallel samples to 1. It happens towards the end. e.g. my most recent run failed at 1484000 steps. There appear to be two ways of crashing. One is a segmentation fault, and the other is the buffer protocol error. In my experience disabling the device put in creating the dataset, iterator seems to help. Another data point: running distributed works fine. I tried to debug this a bit more but haven’t setup building reverb from source. The debug build seems out of sync.
Can't repro it so far ;-(
Oh no... It does take a really long time for me to hit this (normally near the end of 1.5M steps), it could also be a difference in the version of the dependency (TensorFlow, protobuf, etc). I will share here if I have more leads
I updated my code to be compatible with I set up a gdb session to see what the d303127 (the changes to the builder interface), and caught the issue with gdb.
Here's the backtrace
#1 0x00000000005a9c18 in PyType_GenericAlloc ()
#2 0x00007fff7165a5bd in xla::PyBuffer::Make(std::shared_ptr<xla::PyClient>, std::shared_ptr<xla::PjRtBuffer>, std::shared_ptr<xla::Traceback>) ()
from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#3 0x00007fff71666228 in xla::PyClient::BufferFromPyval(pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics) ()
from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#4 0x00007fff713f608d in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<tensorflow::StatusOr<pybind11::object>, xla::PyClient, pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(tensorflow::StatusOr<pybind11::object> (xla::PyClient::*)(pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(xla::PyClient*, pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics)#1}, tensorflow::StatusOr<pybind11::object>, xla::PyClient*, pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(pybind11::cpp_function::initialize<tensorflow::StatusOr<pybind11::object>, xla::PyClient, pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(tensorflow::StatusOr<pybind11::object> (xla::PyClient::*)(pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(xla::PyClient*, pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics)#1}&&, tensorflow::StatusOr<pybind11::object> (*)(xla::PyClient*, pybind11::handle, xla::PjRtDevice*, bool, xla::PjRtClient::HostBufferSemantics), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call) const ()
from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#5 0x00007fff713edc7b in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#6 0x00000000005f3989 in PyCFunction_Call ()
#7 0x00000000005f3e1e in _PyObject_MakeTpCall ()
#8 0x000000000050b183 in ?? ()
#9 0x0000000000570035 in _PyEval_EvalFrameDefault ()
#10 0x00000000005f6836 in _PyFunction_Vectorcall ()
#11 0x000000000056b0ae in _PyEval_EvalFrameDefault ()
#12 0x000000000056939a in _PyEval_EvalCodeWithName ()
#13 0x00000000005f6a13 in _PyFunction_Vectorcall ()
#14 0x00000000005f3547 in PyObject_Call ()
#15 0x000000000056c8cd in _PyEval_EvalFrameDefault ()
--Type <RET> for more, q to quit, c to continue without paging--
#16 0x00000000005f6836 in _PyFunction_Vectorcall ()
#17 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#18 0x00000000005f6836 in _PyFunction_Vectorcall ()
#19 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#20 0x000000000056939a in _PyEval_EvalCodeWithName ()
#21 0x000000000050aaa0 in ?? ()
#22 0x000000000056c28c in _PyEval_EvalFrameDefault ()
#23 0x000000000056939a in _PyEval_EvalCodeWithName ()
#24 0x00000000005f6a13 in _PyFunction_Vectorcall ()
#25 0x00000000005f3547 in PyObject_Call ()
#26 0x000000000056c8cd in _PyEval_EvalFrameDefault ()
#27 0x00000000005006d4 in ?? ()
#28 0x0000000000510b02 in PyIter_Next ()
#29 0x00007fff7146c6a6 in pybind11::iterator::advance() () from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#30 0x00007fff7160c4d5 in pybind11::object xla::PyTreeDef::UnflattenImpl<pybind11::iterable>(pybind11::iterable) const ()
from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#31 0x00007fff7160c98d in xla::PyTreeDef::Unflatten(pybind11::iterable) const ()
from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#32 0x00007fff7160854b in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<pybind11::object, xla::PyTreeDef, pybind11::iterable, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::object (xla::PyTreeDef::*)(pybind11::iterable) const, pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(xla::PyTreeDef const*, pybind11::iterable)#1}, pybind11::object, xla::PyTreeDef const*, pybind11::iterable, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::cpp_function::initialize<pybind11::object, xla::PyTreeDef, pybind11::iterable, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::object (xla::PyTreeDef::*)(pybind11::iterable) const, pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(xla::PyTreeDef const*, pybind11::iterable)#1}&&, pybind11::object (*)(xla::PyTreeDef const*, pybind11::iterable), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#33 0x00007fff713edc7b in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
from /home/yicheng/virtualenvs/orlb/lib/python3.8/site-packages/jaxlib/xla_extension.so
#34 0x00000000005f3989 in PyCFunction_Call ()
#35 0x00000000005f3e1e in _PyObject_MakeTpCall ()
#36 0x000000000050b183 in ?? ()
#37 0x0000000000570035 in _PyEval_EvalFrameDefault ()
#38 0x000000000056939a in _PyEval_EvalCodeWithName ()
--Type <RET> for more, q to quit, c to continue without paging--
#39 0x00000000005f6a13 in _PyFunction_Vectorcall ()
#40 0x000000000056b0ae in _PyEval_EvalFrameDefault ()
#41 0x000000000056939a in _PyEval_EvalCodeWithName ()
#42 0x00000000005f6a13 in _PyFunction_Vectorcall ()
#43 0x0000000000570035 in _PyEval_EvalFrameDefault ()
#44 0x00000000005f6836 in _PyFunction_Vectorcall ()
#45 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#46 0x000000000056939a in _PyEval_EvalCodeWithName ()
#47 0x00000000005f6a13 in _PyFunction_Vectorcall ()
#48 0x000000000056b0ae in _PyEval_EvalFrameDefault ()
#49 0x00000000005f6836 in _PyFunction_Vectorcall ()
#50 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#51 0x00000000005f6836 in _PyFunction_Vectorcall ()
#52 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#53 0x00000000005f6836 in _PyFunction_Vectorcall ()
#54 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#55 0x00000000005f6836 in _PyFunction_Vectorcall ()
#56 0x00000000005f3547 in PyObject_Call ()
#57 0x000000000056c8cd in _PyEval_EvalFrameDefault ()
#58 0x00000000005f6836 in _PyFunction_Vectorcall ()
#59 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#60 0x00000000005f6836 in _PyFunction_Vectorcall ()
#61 0x000000000056b1da in _PyEval_EvalFrameDefault ()
#62 0x00000000005f6836 in _PyFunction_Vectorcall ()
#63 0x000000000050aa2c in ?? ()
#64 0x00000000005f3547 in PyObject_Call ()
#65 0x0000000000655a9c in ?? ()
#66 0x0000000000675738 in ?? ()
#67 0x00007ffff7da0609 in start_thread (arg=<optimised out>) at pthread_create.c:477
#68 0x00007ffff7eda163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
It looks like there's an issue with jaxlib xla_extension, so maybe this is not an acme issue.
OK, I have at least managed to find a way to deterministically trigger the memory allocation issue by calling table.info
.
Use acme master branch, with the dependencies installed, set PYTHONMALLOC=malloc_debug
then run the sac example.
venv ❯ python run_sac.py --env_name control:cheetah:run
I0612 04:34:16.762904 140433669838656 __init__.py:69] MUJOCO_GL is not set, so an OpenGL backend will be chosen automatically.
/home/yicheng/projects/acme/venv/lib/python3.8/site-packages/glfw/__init__.py:906: GLFWError: (65544) b'X11: The DISPLAY environment variable is missing'
warnings.warn(message, GLFWError)
I0612 04:34:16.835770 140433669838656 __init__.py:77] Successfully imported OpenGL backend: glfw
I0612 04:34:16.915179 140433669838656 __init__.py:31] MuJoCo library version is: 200
I0612 04:34:17.078244 140433669838656 xla_bridge.py:260] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
I0612 04:34:17.078425 140433669838656 xla_bridge.py:260] Unable to initialize backend 'gpu': NOT_FOUND: Could not find registered platform with name: "cuda". Available platform names are: Host Interpreter
I0612 04:34:17.078679 140433669838656 xla_bridge.py:260] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
W0612 04:34:17.078754 140433669838656 xla_bridge.py:265] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Fatal Python error: Python memory allocator called without holding the GIL
Python runtime state: initialized
Current thread 0x00007fb943015740 (most recent call first):
File "/home/yicheng/projects/acme/venv/lib/python3.8/site-packages/reverb/server.py", line 228 in info
File "/home/yicheng/projects/acme/acme/jax/experiments/run_experiment.py", line 93 in _disable_insert_blocking
File "/home/yicheng/projects/acme/acme/jax/experiments/run_experiment.py", line 136 in <listcomp>
File "/home/yicheng/projects/acme/acme/jax/experiments/run_experiment.py", line 136 in run_experiment
File "run_sac.py", line 82 in main
File "/home/yicheng/projects/acme/venv/lib/python3.8/site-packages/absl/app.py", line 258 in _run_main
File "/home/yicheng/projects/acme/venv/lib/python3.8/site-packages/absl/app.py", line 312 in run
File "run_sac.py", line 89 in <module>
zsh: abort (core dumped) python run_sac.py --env_name control:cheetah:run
I have created a fix to this issue.
Thanks for the great investigation of the problem and a fix. Your Pull requests is now merged in Reverb, while I have another change to the run_experiment.py in flight, which eliminates the use of Table's info (it should run faster).
Sounds good. Eliminating table.info all together would be nice, right now in order to pick up my fix I have to build launchpad and reverb by myself, and I think launchpad nightly has not been built lately.
I manage to avoid blocking by setting a smaller value for the max in flight samples in https://github.com/deepmind/acme/issues/233 but now I get a new error segmentation fault error.
https://github.com/deepmind/acme/issues/233
Looks like the issue comes from https://github.com/deepmind/acme/blob/2871e3216d2ffc2bc0ffea8b6a0e3071897608b9/acme/agents/agent.py#L105-L108. When deserializing the protobuf string from the internal table, a segmentation fault happened. I can occasionally get an error saying that 'INVALID_ARGUMENT: Python buffer protocol is only defined for CPU buffers' but I haven't been able to consistently reproduce that.