Open pawalt opened 4 days ago
None of these env variables are present on the host - it populates them dynamically
Seems like the sandbox cannot reach the GCE metadata server for some reason. I'm surprised that --network=host
does not fix this. I will investigate and try to repro myself. If you can share your strace logs that will help me debug the issue as well.
runsc.log.20240918-170153.986977.boot.txt Here are the strace logs - had to upload a file as it's quite a lot of logs.
It looks like --network=host
is not true in the logs you sent.
From runsc.log.20240918-170153.986977.boot.txt:
D0918 17:01:54.094908 1 config.go:439] Config.Network (--network): sandbox
You'll also want to run runsc
in the host's network namespace to allow proper host network access. Right now it looks like you're running in a separate network namespace.
From runsc.log.20240918-170153.986977.boot.txt:
"linux": {
"namespaces": [
{
"type": "pid"
},
{
"type": "network"
},
{
"type": "ipc"
},
{
"type": "uts"
},
{
"type": "mount"
}
]
}
FWIW I tested this on my own V5 GCE VM and it worked.
Here was my /dev/vfio
directory:
$ stat /dev/vfio/*
File: /dev/vfio/0
Size: 0 Blocks: 0 IO Block: 4096 character special file
Device: 5h/5d Inode: 334 Links: 1 Device type: ef,0
Access: (0666/crw-rw-rw-) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2024-09-18 21:11:48.884000071 +0000
Modify: 2024-09-18 21:11:48.884000071 +0000
Change: 2024-09-18 21:11:48.888000071 +0000
Birth: 2024-09-18 21:11:48.884000071 +0000
File: /dev/vfio/vfio
Size: 0 Blocks: 0 IO Block: 4096 character special file
Device: 5h/5d Inode: 141 Links: 1 Device type: a,c4
Access: (0666/crw-rw-rw-) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2024-09-18 21:11:48.832000070 +0000
Modify: 2024-09-18 21:11:48.832000070 +0000
Change: 2024-09-18 21:11:48.832000070 +0000
Birth: 2024-09-18 21:11:45.235863372 +0000
Here was my config.json
: https://gist.github.com/manninglucas/14de68aab7abaab02cf41553f900782e
My command was: sudo ./runsc --debug --debug-log=debug.txt --network=host --tpuproxy run bash
In the container, I ran:
$ pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
$ python
>>> import jax
>>> jax.device_count()
I'll just add as a note that these environment variables are piped into the configs automatically in GKE. In GCE you'll either have to add the environment variables to your spec yourself (maybe by fetching them from the metadata server before starting a sandbox) or give the sandbox at least some host network access.
Description
When a TPU container is initialized, it's missing some environment variables that JAX needs in order to initialize. In the absence of these variables, JAX attempts to look up their values over the network. This fails as the container may not have direct access to the network.
I have also tried this with
network=host
to no avail.Steps to reproduce
Run a jax image with
--tpuproxy
:Runsc command:
Start the container:
I've built this image by exporting the following dockerfile:
runsc version
docker version (if using docker)
uname
uname -a Linux t1v-n-1f714773-w-0 5.19.0-1022-gcp #24~22.04.1-Ubuntu SMP Sun Apr 23 09:51:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
No response
repo state (if built from source)
git describe release-20240826.0-81-g4bcbb55fc
runsc debug logs (if available)
No response