google-research / scenic

Scenic: A Jax Library for Computer Vision Research and Beyond
Apache License 2.0
3.34k stars 441 forks source link

owl_vit training problem.Unable to connect to Google during training, unable to retrieve data. How can data be stored locally and read in #1091

Open lxyzler opened 3 months ago

lxyzler commented 3 months ago

python -m scenic.projects.owl_vit.main --alsologtostderr=true --workdir=/tmp/training --config=scenic/projects/owl_vit/configs/clip_b32_finetune.py

2024-08-08 01:14:33.266603: W external/xla/xla/service/gpu/nvptx_compiler.cc:836] The NVIDIA driver's CUDA version is 12.2 which is older than the PTX compiler version (12.5.82). Because the driver is older than the PTX compiler version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages. I0808 01:14:37.952140 140605538547520 app.py:92] JAX host: 0 / 1 I0808 01:14:37.952368 140605538547520 app.py:93] JAX devices: [CudaDevice(id=0), CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3), CudaDevice(id=4), CudaDevice(id=5), CudaDevice(id=6), CudaDevice(id=7)] I0808 01:14:37.952456 140605538547520 local.py:45] Setting task status: host_id: 0, hostcount: 1 I0808 01:14:37.952512 140605538547520 local.py:50] Created artifact Workdir of type ArtifactType.DIRECTORY and value /tmp/training. I0808 01:14:37.954501 140605538547520 app.py:104] RNG: [0 0] I0808 01:14:38.603692 140605538547520 checkpoints.py:1101] Found no checkpoint files in /tmp/training with prefix checkpoint &&&&&&&&&&&&&&&&&&&&&&&&&&&&& I0808 01:14:38.604115 140605538547520 train_utils.py:380] device_count: 8 I0808 01:14:38.604308 140605538547520 train_utils.py:381] num_hosts : 1 I0808 01:14:38.604445 140605538547520 train_utils.py:382] host_id : 0 I0808 01:14:38.605386 140605538547520 train_utils.py:405] local_batch_size : 256 I0808 01:14:38.605548 140605538547520 train_utils.py:406] device_batch_size : 32 2024-08-08 01:14:38.973571: W external/local_tsl/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal". 2024-08-08 01:15:39.983812: E external/local_tsl/tsl/platform/cloud/curl_http_request.cc:610] The transmission of request 0xdc1a0d0 (URI: https://www.googleapis.com/storage/v1/b/tfds-data/o/dataset_info%2Flvis%2F1.3.0?fields=size%2Cgeneration%2Cupdated) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.010952 (No error), connect time: 0 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)