Open ayaka14732 opened 1 week ago
How did you install JAX?
@yashk2810 I ran these commands on all hosts:
python3.12 -m venv ~/venv
. ~/venv/bin/activate
pip install -U pip
pip install -U wheel
pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
Can you show me your pip freeze
?
@yashk2810
$ pip freeze
certifi==2024.6.2
charset-normalizer==3.3.2
idna==3.7
jax==0.4.30
jaxlib==0.4.30
libtpu-nightly==0.1.dev20240617
ml-dtypes==0.4.0
numpy==2.0.0
opt-einsum==3.3.0
requests==2.32.3
scipy==1.14.0
urllib3==2.2.2
wheel==0.43.0
Can you try running JAX_DEBUG_LOG_MODULES=jax._src.xla_bridge python -c 'import jax; print(jax.devices())'
and paste the output here?
@skye
DEBUG:2024-06-25 00:34:55,039:jax._src.xla_bridge:575: No jax_plugins namespace packages available
DEBUG:2024-06-25 00:34:55,049:jax._src.xla_bridge:969: Initializing backend 'cpu'
DEBUG:2024-06-25 00:34:55,109:jax._src.xla_bridge:981: Backend 'cpu' initialized
DEBUG:2024-06-25 00:34:55,109:jax._src.xla_bridge:969: Initializing backend 'cuda'
INFO:2024-06-25 00:34:55,109:jax._src.xla_bridge:889: Unable to initialize backend 'cuda':
DEBUG:2024-06-25 00:34:55,109:jax._src.xla_bridge:969: Initializing backend 'rocm'
INFO:2024-06-25 00:34:55,109:jax._src.xla_bridge:889: Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
DEBUG:2024-06-25 00:34:55,109:jax._src.xla_bridge:969: Initializing backend 'tpu'
INFO:2024-06-25 00:34:55,142:jax._src.xla_bridge:889: Unable to initialize backend 'tpu': ABORTED: The TPU is already in use by process with pid 176028. Not attempting to load libtpu.so in this process.
WARNING:2024-06-25 00:34:55,143:jax._src.xla_bridge:940: A Google TPU may be present on this machine, but either a TPU-enabled jaxlib or libtpu is not installed. Falling back to cpu.
[CpuDevice(id=0)]
DEBUG:2024-06-25 00:34:55,812:jax._src.xla_bridge:575: No jax_plugins namespace packages available
DEBUG:2024-06-25 00:34:55,813:jax._src.xla_bridge:575: No jax_plugins namespace packages available
DEBUG:2024-06-25 00:34:55,836:jax._src.xla_bridge:575: No jax_plugins namespace packages available
DEBUG:2024-06-25 00:34:55,866:jax._src.xla_bridge:969: Initializing backend 'cpu'
DEBUG:2024-06-25 00:34:55,875:jax._src.xla_bridge:969: Initializing backend 'cpu'
DEBUG:2024-06-25 00:34:55,919:jax._src.xla_bridge:969: Initializing backend 'cpu'
DEBUG:2024-06-25 00:34:55,928:jax._src.xla_bridge:981: Backend 'cpu' initialized
DEBUG:2024-06-25 00:34:55,928:jax._src.xla_bridge:969: Initializing backend 'cuda'
INFO:2024-06-25 00:34:55,928:jax._src.xla_bridge:889: Unable to initialize backend 'cuda':
DEBUG:2024-06-25 00:34:55,928:jax._src.xla_bridge:969: Initializing backend 'rocm'
INFO:2024-06-25 00:34:55,928:jax._src.xla_bridge:889: Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
DEBUG:2024-06-25 00:34:55,928:jax._src.xla_bridge:969: Initializing backend 'tpu'
DEBUG:2024-06-25 00:34:55,936:jax._src.xla_bridge:981: Backend 'cpu' initialized
DEBUG:2024-06-25 00:34:55,936:jax._src.xla_bridge:969: Initializing backend 'cuda'
INFO:2024-06-25 00:34:55,936:jax._src.xla_bridge:889: Unable to initialize backend 'cuda':
DEBUG:2024-06-25 00:34:55,936:jax._src.xla_bridge:969: Initializing backend 'rocm'
INFO:2024-06-25 00:34:55,936:jax._src.xla_bridge:889: Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
DEBUG:2024-06-25 00:34:55,936:jax._src.xla_bridge:969: Initializing backend 'tpu'
INFO:2024-06-25 00:34:55,960:jax._src.xla_bridge:889: Unable to initialize backend 'tpu': ABORTED: The TPU is already in use by process with pid 463521. Not attempting to load libtpu.so in this process.
WARNING:2024-06-25 00:34:55,961:jax._src.xla_bridge:940: A Google TPU may be present on this machine, but either a TPU-enabled jaxlib or libtpu is not installed. Falling back to cpu.
[CpuDevice(id=0)]
INFO:2024-06-25 00:34:55,970:jax._src.xla_bridge:889: Unable to initialize backend 'tpu': ABORTED: The TPU is already in use by process with pid 482056. Not attempting to load libtpu.so in this process.
WARNING:2024-06-25 00:34:55,971:jax._src.xla_bridge:940: A Google TPU may be present on this machine, but either a TPU-enabled jaxlib or libtpu is not installed. Falling back to cpu.
[CpuDevice(id=0)]
DEBUG:2024-06-25 00:34:55,982:jax._src.xla_bridge:981: Backend 'cpu' initialized
DEBUG:2024-06-25 00:34:55,982:jax._src.xla_bridge:969: Initializing backend 'cuda'
INFO:2024-06-25 00:34:55,982:jax._src.xla_bridge:889: Unable to initialize backend 'cuda':
DEBUG:2024-06-25 00:34:55,982:jax._src.xla_bridge:969: Initializing backend 'rocm'
INFO:2024-06-25 00:34:55,982:jax._src.xla_bridge:889: Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
DEBUG:2024-06-25 00:34:55,982:jax._src.xla_bridge:969: Initializing backend 'tpu'
INFO:2024-06-25 00:34:56,015:jax._src.xla_bridge:889: Unable to initialize backend 'tpu': ABORTED: The TPU is already in use by process with pid 149052. Not attempting to load libtpu.so in this process.
WARNING:2024-06-25 00:34:56,016:jax._src.xla_bridge:940: A Google TPU may be present on this machine, but either a TPU-enabled jaxlib or libtpu is not installed. Falling back to cpu.
[CpuDevice(id=0)]
From the logs I realised the actual reason is that the TPU is used by another process. It works after the process is killed.
Ah. This is supposed to be raised as an exception instead of falling back to CPU. That functionality must have regressed. Now to figure out why...
Description
I have installed the TPU version of JAX (including jaxlib and libtpu) on all hosts of a TPU Pod inside a venv. Then, I run the following command on all hosts:
I got this error:
System info (python version, jaxlib version, accelerator, etc.)