Open ayaka14732 opened 2 years ago
@skye I found at least one of the cause of the problem:
$ export TF_CPP_MIN_LOG_LEVEL=0
$ python3 -c 'import jax; print(jax.devices())'
2022-04-09 23:15:46.156187: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.156229: I external/org_tensorflow/tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so
WARNING: Logging before InitGoogle() is written to STDERR
I0409 23:15:46.181567 1339488 tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.195009: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.198059: I external/org_tensorflow/tensorflow/core/tpu/tpu_executor_dlsym_initializer.cc:68] Libtpu path is: libtpu.so
2022-04-09 23:15:46.198096: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.833145: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:171] XLA service 0x31c7860 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2022-04-09 23:15:46.833194: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:179] StreamExecutor device (0): Interpreter, <undefined>
2022-04-09 23:15:46.850903: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/tfrt_cpu_pjrt_client.cc:163] TfrtCpuClient created.
2022-04-09 23:15:46.851741: I external/org_tensorflow/tensorflow/stream_executor/tpu/tpu_platform_interface.cc:74] No TPU platform found.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
So I opened tensorflow/core/tpu/tpu_initializer_helper.cc, it shows that the program checks a lock file /tmp/libtpu_lockfile
.
So I:
$ sudo rm -f /tmp/libtpu_lockfile
$ python3 -c 'import jax; print(jax.devices())'
Then it complains about /tmp/tpu_logs
, so I removed the directory.
After that the command works:
$ python3 -c 'import jax; print(jax.devices())'
[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(0,0,0), core_on_chip=1), TpuDevice(id=2, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,0,0), core_on_chip=1), TpuDevice(id=4, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=5, process_index=0, coords=(0,1,0), core_on_chip=1), TpuDevice(id=6, process_index=0, coords=(1,1,0), core_on_chip=0), TpuDevice(id=7, process_index=0, coords=(1,1,0), core_on_chip=1)]
@skye This is indeed the cause of the problem! And this also explains why TPU is available for one user but not for others:
(user1) $ ls -l /tmp/libtpu_lockfile
-rw-r--r-- 1 user1 user1 0 Mar 30 10:32 /tmp/libtpu_lockfile
(user1) $ python -c 'import jax; print(jax.devices())' # works
(user1) $ sudo -H -- su -
(root) $ python -c 'import jax; print(jax.devices())' # not working
(root) $ chown root:root /tmp/libtpu_lockfile
(root) $ python -c 'import jax; print(jax.devices())' # works
(root) $ exit
(user1) $ python -c 'import jax; print(jax.devices())' # not working
(user1) $ sudo chmod go+rw /tmp/libtpu_lockfile
(user1) $ python -c 'import jax; print(jax.devices())' # works
Hey @ayaka14732, sorry for the delay! I was hiking in Nepal :mountain_snow:
Just to make sure I understand, is the issue that /tmp/libtpu_lockfile
sometimes exists even when no process is using the TPU? I'm not sure what "works" and "not works" means in your comment above.
@skye Haven't seen you on GitHub for a long time. I miss you so much!
Just to make sure I understand, is the issue that /tmp/libtpu_lockfile sometimes exists even when no process is using the TPU?
Yes!
I'm not sure what "works" and "not works" means in your comment above.
"works" means can access TPU (prints TpuDevice
), otherwise it prints CpuDevice
.
This is strange. The lockfile should be automatically deleted when the Python process exits. Can you rerun the same set of commands again, but make sure the lockfile doesn't exist at the beginning? (i.e. ls -l /tmp/libtpu_lockfile
should say the file doesn't exist)
Sometimes the Python process was aborted, and therefore the file is not removed. For example,
import jax
devices = jax.devices()
import sys
sys.exit(-1)
The set of commands above was an explanation that /tmp/libtpu_lockfile
is indeed the cause of the problem.
I'm having the exact same issue - but deleting the lockfile doesn't seem to fix it. Any pointers would be super appreciated! @skye @ayaka14732
@shashank2000 that error message is only triggered by the presence of the lockfile. Please doublecheck that the lockfile isn't there and that you're getting the exact error message?
@skye I can verify I still see the message even after I delete both tpu_logs
and the lockfile. I've been running processes as Python subprocesses through a script - is there a reason to believe this could be an issue?
EDIT: can confirm that things work 'fine' when I run my commands standalone as opposed to a subprocess
or os.system
call which is what used to be the case; is there a reason my parent script (which doesn't even have a jax dependency, all I'm doing is parsing args and running the appropriate training job) is the process holding the lock?
The underlying lock file (/tmp/libtpu_lockfile) is not removed on program exit but reused across runs and users.
So the open() permissions for it should be 0666 in order to work with multiple users (different logins on the same TPU VMs for example). The lockfile will only be openable in write mode by the first user that creates it.
Setting permissions to 0666 instead of 0644 will allow others to open it and thus have a valid fd for the lockf() call below.
@shashank2000
I'm having the exact same issue - but deleting the lockfile doesn't seem to fix it. Any pointers would be super appreciated!
This would happen if the Python program is still running.
I use ps aux | grep python
to view all existing Python programs. If the program is still running, I use kill -9 <PID>
to kill it.
It's been a while since someone commented on this.
But if anyone is having the same issue, in my case ps aux | grep python
didn't show my processes when connected by ssh to the TPU node, but top
did.
It also shows the PID, which can then be used to kill the process as mentioned above by @ayaka14732 .
Sep 2022 Update
Solution: Run
before running Python.
Original Post
We can test if TPU is being used by this command:
In theory, if the TPU is not in use, it will print
TpuDevice
; otherwise, it will printCpuDevice
, and a warning will be shown:However, sometimes the command shows that TPU is being used, but I can be sure that the TPU is not being used. Besides,
sudo lsof -w /dev/accel0
shows no process is using TPU.In order to rule out the possibility that another process that was using TPU just exited, I reran the command for several times and the results are the same.
This bug even happens when I created multiple users on the TPU VM. I login in as one user and it shows that the TPU is in used, but then I immediately log in as another user and it works fine.
I want to help to debug this issue but I don't know where to start.