jax-ml / jax

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
http://jax.readthedocs.io/
Apache License 2.0
30.51k stars 2.8k forks source link

LLVM ERROR: pthread_create failed: Resource temporarily unavailable only when running inside a container #24631

Open Dekermanjian opened 2 weeks ago

Dekermanjian commented 2 weeks ago

Description

I am working on a linux server where I need to run a numpyro model in parallel. If I run my model directly on the server everything works fine. However, when I run inside a podman (rootless) container I get the following error message:

Check failed: ret == 0 (11 vs. 0)Thread tf_XLAEigen creation via pthread_create() failed.
LLVM ERROR: pthread_create failed: Resource temporarily unavailable

I am using the following ENV variables to try to limit jax's threading:

Since, it runs perfectly fine outside the container I am ruling out the following:

I don't know what else to try and help would be greatly appreciated.

System info (python version, jaxlib version, accelerator, etc.)

>>> jax.print_environment_info()
jax:    0.4.34
jaxlib: 0.4.34
numpy:  1.26.4
python: 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0]
jax.devices (1 total, 1 local): [CpuDevice(id=0)]
process_count: 1
platform: uname_result(system='Linux', node='209c45534d47', release='4.18.0-553.16.1.el8_10.x86_64', version='#1 SMP Thu Aug 8 17:47:08 UTC 2024', machine='x86_64')
hawkinsp commented 2 weeks ago

I don't suppose you can determine how many threads JAX is using here? Is there a simple repro I can try?

Dekermanjian commented 2 weeks ago

Hey @hawkinsp thank you for the very fast response. After tearing my hair out for 2 days I finally figured out what was happening. It turns out that podman limits the number of PIDs that can be created by a container. I was able to override the limit by adding: [containers] pids_limit=0

Note: Maximum number of processes allowed in a container. 0 indicates that no limit is imposed.

To $HOME/.config/containers/containers.conf

This fixes the problem.