jax-ml / jax

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
http://jax.readthedocs.io/
Apache License 2.0
30.59k stars 2.82k forks source link

'libtpu.so already in use' but actually not used #10192

Open ayaka14732 opened 2 years ago

ayaka14732 commented 2 years ago

Sep 2022 Update

Solution: Run

rm -rf /tmp/libtpu_lockfile /tmp/tpu_logs

before running Python.


Original Post

We can test if TPU is being used by this command:

python -c 'import jax; print(jax.devices())'

In theory, if the TPU is not in use, it will print TpuDevice; otherwise, it will print CpuDevice, and a warning will be shown:

I0000 00:00:1649423660.053391 1924758 f236.cc:165] libtpu.so already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. Not attempting to load libtpu.so in this process.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

However, sometimes the command shows that TPU is being used, but I can be sure that the TPU is not being used. Besides, sudo lsof -w /dev/accel0 shows no process is using TPU.

In order to rule out the possibility that another process that was using TPU just exited, I reran the command for several times and the results are the same.

This bug even happens when I created multiple users on the TPU VM. I login in as one user and it shows that the TPU is in used, but then I immediately log in as another user and it works fine.

I want to help to debug this issue but I don't know where to start.

ayaka14732 commented 2 years ago

@skye I found at least one of the cause of the problem:

$ export TF_CPP_MIN_LOG_LEVEL=0
$ python3 -c 'import jax; print(jax.devices())'
2022-04-09 23:15:46.156187: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.156229: I external/org_tensorflow/tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so
WARNING: Logging before InitGoogle() is written to STDERR
I0409 23:15:46.181567 1339488 tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.195009: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.198059: I external/org_tensorflow/tensorflow/core/tpu/tpu_executor_dlsym_initializer.cc:68] Libtpu path is: libtpu.so
2022-04-09 23:15:46.198096: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.833145: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:171] XLA service 0x31c7860 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2022-04-09 23:15:46.833194: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:179]   StreamExecutor device (0): Interpreter, <undefined>
2022-04-09 23:15:46.850903: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/tfrt_cpu_pjrt_client.cc:163] TfrtCpuClient created.
2022-04-09 23:15:46.851741: I external/org_tensorflow/tensorflow/stream_executor/tpu/tpu_platform_interface.cc:74] No TPU platform found.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

So I opened tensorflow/core/tpu/tpu_initializer_helper.cc, it shows that the program checks a lock file /tmp/libtpu_lockfile.

So I:

$ sudo rm -f /tmp/libtpu_lockfile
$ python3 -c 'import jax; print(jax.devices())'

Then it complains about /tmp/tpu_logs, so I removed the directory.

After that the command works:

$ python3 -c 'import jax; print(jax.devices())'
[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(0,0,0), core_on_chip=1), TpuDevice(id=2, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,0,0), core_on_chip=1), TpuDevice(id=4, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=5, process_index=0, coords=(0,1,0), core_on_chip=1), TpuDevice(id=6, process_index=0, coords=(1,1,0), core_on_chip=0), TpuDevice(id=7, process_index=0, coords=(1,1,0), core_on_chip=1)]
ayaka14732 commented 2 years ago

@skye This is indeed the cause of the problem! And this also explains why TPU is available for one user but not for others:

(user1) $ ls -l /tmp/libtpu_lockfile
-rw-r--r-- 1 user1 user1 0 Mar 30 10:32 /tmp/libtpu_lockfile
(user1) $ python -c 'import jax; print(jax.devices())'  # works
(user1) $ sudo -H -- su -
(root) $ python -c 'import jax; print(jax.devices())'  # not working
(root) $ chown root:root /tmp/libtpu_lockfile
(root) $ python -c 'import jax; print(jax.devices())'  # works
(root) $ exit
(user1) $ python -c 'import jax; print(jax.devices())'  # not working
(user1) $ sudo chmod go+rw /tmp/libtpu_lockfile
(user1) $ python -c 'import jax; print(jax.devices())'  # works
skye commented 2 years ago

Hey @ayaka14732, sorry for the delay! I was hiking in Nepal :mountain_snow:

Just to make sure I understand, is the issue that /tmp/libtpu_lockfile sometimes exists even when no process is using the TPU? I'm not sure what "works" and "not works" means in your comment above.

ayaka14732 commented 2 years ago

@skye Haven't seen you on GitHub for a long time. I miss you so much!

Just to make sure I understand, is the issue that /tmp/libtpu_lockfile sometimes exists even when no process is using the TPU?

Yes!

I'm not sure what "works" and "not works" means in your comment above.

"works" means can access TPU (prints TpuDevice), otherwise it prints CpuDevice.

skye commented 2 years ago

This is strange. The lockfile should be automatically deleted when the Python process exits. Can you rerun the same set of commands again, but make sure the lockfile doesn't exist at the beginning? (i.e. ls -l /tmp/libtpu_lockfile should say the file doesn't exist)

ayaka14732 commented 2 years ago

Sometimes the Python process was aborted, and therefore the file is not removed. For example,

import jax
devices = jax.devices()
import sys
sys.exit(-1)

The set of commands above was an explanation that /tmp/libtpu_lockfile is indeed the cause of the problem.

shashank2000 commented 2 years ago

I'm having the exact same issue - but deleting the lockfile doesn't seem to fix it. Any pointers would be super appreciated! @skye @ayaka14732

skye commented 2 years ago

@shashank2000 that error message is only triggered by the presence of the lockfile. Please doublecheck that the lockfile isn't there and that you're getting the exact error message?

shashank2000 commented 2 years ago

@skye I can verify I still see the message even after I delete both tpu_logs and the lockfile. I've been running processes as Python subprocesses through a script - is there a reason to believe this could be an issue?

EDIT: can confirm that things work 'fine' when I run my commands standalone as opposed to a subprocess or os.system call which is what used to be the case; is there a reason my parent script (which doesn't even have a jax dependency, all I'm doing is parsing args and running the appropriate training job) is the process holding the lock?

janimo commented 1 year ago

The underlying lock file (/tmp/libtpu_lockfile) is not removed on program exit but reused across runs and users.

So the open() permissions for it should be 0666 in order to work with multiple users (different logins on the same TPU VMs for example). The lockfile will only be openable in write mode by the first user that creates it.

Setting permissions to 0666 instead of 0644 will allow others to open it and thus have a valid fd for the lockf() call below.

https://github.com/tensorflow/tensorflow/blob/68c69e710f95f9a5a8fa91c1fbcb642166176944/tensorflow/core/tpu/tpu_initializer_helper.cc#L159

ayaka14732 commented 1 year ago

@shashank2000

I'm having the exact same issue - but deleting the lockfile doesn't seem to fix it. Any pointers would be super appreciated!

This would happen if the Python program is still running.

I use ps aux | grep python to view all existing Python programs. If the program is still running, I use kill -9 <PID> to kill it.

JadM133 commented 2 weeks ago

It's been a while since someone commented on this.

But if anyone is having the same issue, in my case ps aux | grep python didn't show my processes when connected by ssh to the TPU node, but top did.

It also shows the PID, which can then be used to kill the process as mentioned above by @ayaka14732 .