Error when saving the checkpoint

kunfang98927 commented 2 years ago

Hi! I met an issue when saving the checkpoint. I also commented my issue below #446

This issue occurred when I ended training for 100 steps and saved the checkpoint to the absolute path '/gpfsnyu/scratch/kf2395/jukemir_t5/pretrain/'

My Tensorstore is 0.1.19.

As this solution mentioned, the relative path may cause some problems, so I changed it to the absolute path.

When I used the relative path at the beginning, I got the similar error as this comment. The error message:

ValueError: Error opening "zarr" driver: Error reading local file "./pretrain_model/checkpoint_5000.tmp-1650694933/state.param_states.decoder.decoder_norm.scale.v/.zarray": Invalid key: "./pretrain_model/checkpoint_5000.tmp-1650694933/state.param_states.decoder.decoder_norm.scale.v/.zarray" In call to configurable 'train' (<function train at 0x7fa6818348c0>))

Then I changed the path to the absolute path and the issue above was solved. But a new issue occurred.

ValueError: Error opening "zarr" driver: Error writing local file "/gpfsnyu/scratch/kf2395/jukemir_t5/pretrain/checkpoint_100.tmp-1650949633/state.param_states.decoder.layers_0.pre_cross_attention_layer_norm.scale.v/.zarray": Failed to acquire lock on file: /gpfsnyu/scratch/kf2395/jukemir_t5/pretrain/checkpoint_100.tmp-1650949633/state.param_states.decoder.layers_0.pre_cross_attention_layer_norm.scale.v/.zarray.__lock [OS error: Invalid argument] In call to configurable 'train' (<function train at 0x7f651e1e78c0>)

I tried to delete all the files in '/gpfsnyu/scratch/kf2395/jukemir_t5/pretrain/' and trained again. But this issue still existed.

The detailed error message:

I0426 13:05:36.808195 140074531202880 train.py:516] Epoch 0 of 10000 I0426 13:05:36.808564 140055117031168 logging_writer.py:48] [0] collection=train timing/compilation_seconds=160.272345 I0426 13:05:36.828166 140074531202880 train.py:522] BEGIN Train loop. I0426 13:05:36.828350 140074531202880 train.py:527] Training for 100 steps. I0426 13:05:36.833504 140074531202880 trainer.py:517] Training: step 0 I0426 13:05:47.585027 140074531202880 trainer.py:517] Training: step 12 I0426 13:05:58.556400 140074531202880 trainer.py:517] Training: step 23 I0426 13:06:09.237899 140074531202880 trainer.py:517] Training: step 34 I0426 13:06:19.734536 140074531202880 trainer.py:517] Training: step 45 I0426 13:06:30.668152 140074531202880 trainer.py:517] Training: step 56 I0426 13:06:41.496444 140074531202880 trainer.py:517] Training: step 67 I0426 13:06:52.412244 140074531202880 trainer.py:517] Training: step 78 I0426 13:07:03.236425 140074531202880 trainer.py:517] Training: step 89 I0426 13:07:13.692245 140074531202880 train.py:550] END Train loop. I0426 13:07:13.727353 140055117031168 logging_writer.py:48] [100] collection=train accuracy=0.12926435470581055, cross_ent_loss=3456.254063, cross_ent_loss_per_all_target_tokens=0.337525, learning_rate=0.001000000280328095, learning_rate/current=0.0010000000474974513, loss=3460.679688, loss_per_all_target_tokens=0.337957, loss_per_nonpadding_target_token=5.071336, nonpadding_fraction=0.066641, timing/seconds=96.861853, timing/seqs=1000, timing/seqs_per_second=10.323982, timing/seqs_per_second_per_core=10.323982, timing/steps_per_second=1.032398, timing/target_tokens_per_second=10571.757297, timing/target_tokens_per_second_per_core=10571.757297, z_loss=4.426097, z_loss_per_all_target_tokens=0.000432 I0426 13:07:13.728666 140074531202880 train.py:565] Saving checkpoint. I0426 13:07:13.730171 140074531202880 checkpoints.py:631] Saving checkpoint for step 100 to /gpfsnyu/scratch/kf2395/jukemir_t5/pretrain/checkpoint_100.tmp-1650949633 Traceback (most recent call last): File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/gpfsnyu/scratch/kf2395/jukemir_t5/t5x/train.py", line 663, in gin_utils.run(main) File "/gpfsnyu/scratch/kf2395/jukemir_t5/t5x/gin_utils.py", line 107, in run flags_parser=lambda a: app.parse_flags_with_usage(rewrite_gin_args(a))) File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "/gpfsnyu/scratch/kf2395/jukemir_t5/t5x/train.py", line 641, in main _main(argv) File "/gpfsnyu/scratch/kf2395/jukemir_t5/t5x/train.py", line 661, in _main train_using_gin() File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/gin/config.py", line 1605, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise raise proxy.with_traceback(exception.traceback) from None File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/gin/config.py", line 1582, in gin_wrapper return fn(*new_args, *new_kwargs) File "/gpfsnyu/scratch/kf2395/jukemir_t5/t5x/train.py", line 568, in train checkpoint_cfg.save.state_transformation_fns) File "/gpfsnyu/scratch/kf2395/jukemir_t5/t5x/checkpoints.py", line 639, in save tmp_dir, train_state, concurrent_gb, state_transformation_fns) File "/gpfsnyu/scratch/kf2395/jukemir_t5/t5x/checkpoints.py", line 806, in _write_state_to_tensorstore written_state_dict = _run_future_tree(future_written_state) File "/gpfsnyu/scratch/kf2395/jukemir_t5/t5x/checkpoints.py", line 167, in _run_future_tree leaves = loop.run_until_complete(asyncio.gather(future_leaves)) File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete return future.result() File "/gpfsnyu/scratch/kf2395/jukemir_t5/t5x/checkpoints.py", line 770, in _write_array 'limit': 128 ValueError: Error opening "zarr" driver: Error writing local file "/gpfsnyu/scratch/kf2395/jukemir_t5/pretrain/checkpoint_100.tmp-1650949633/state.param_states.decoder.layers_0.pre_cross_attention_layer_norm.scale.v/.zarray": Failed to acquire lock on file: /gpfsnyu/scratch/kf2395/jukemir_t5/pretrain/checkpoint_100.tmp-1650949633/state.param_states.decoder.layers_0.pre_cross_attention_layer_norm.scale.v/.zarray.__lock [OS error: Invalid argument] In call to configurable 'train' (<function train at 0x7f651e1e78c0>)

Thank you for your kindly help!

adarob commented 2 years ago

@jbms would you mind taking a look at this?

jbms commented 2 years ago

TensorStore, which is used by t5x for checkpointing, uses POSIX file locks to coordinate concurrent access from multiple machines.

From the error message it looks like your filesystem may not support POSIX file locks. From reading the documentation of GPFS, it appears to support POSIX file locking but I think there may be an option to enable or disable it. Can you check with your cluster administrator to see whether POSIX file locking is supported?

kunfang98927 commented 2 years ago

TensorStore, which is used by t5x for checkpointing, uses POSIX file locks to coordinate concurrent access from multiple machines.

From the error message it looks like your filesystem may not support POSIX file locks. From reading the documentation of GPFS, it appears to support POSIX file locking but I think there may be an option to enable or disable it. Can you check with your cluster administrator to see whether POSIX file locking is supported?

Thank you for your kindly reply. I'm checking this with the administrator and still waiting for the reply. By the way, I found that the filenames involved in the error message are different in each training, I'm not sure if this helps to find the cause of the problem...

The bolded file paths in the following error message below are different each time.

ValueError: Error opening "zarr" driver: Error writing local file "/gpfsnyu/scratch/kf2395/jukemir_t5/pretrain/checkpoint_100.tmp-1650949633/state.param_states.decoder.layers_0.pre_cross_attention_layer_norm.scale.v/.zarray": Failed to acquire lock on file: /gpfsnyu/scratch/kf2395/jukemir_t5/pretrain/checkpoint_100.tmp-1650949633/state.param_states.decoder.layers_0.pre_cross_attention_layer_norm.scale.v/.zarray.__lock [OS error: Invalid argument]

For example, the following file paths have appeared:

"/gpfsnyu/scratch/kf2395/jukemir_t5/pretrain/checkpoint_100.tmp-xxxxxxxxxxxx/state.param_states.decoder.layers_0.pre_self_attention_layer_norm.scale.v/.zarray" "/gpfsnyu/scratch/kf2395/jukemir_t5/pretrain/checkpoint_100.tmp-xxxxxxxxxxxx/state.param_states.decoder.layers_0.pre_mlp_attention_layer_norm.scale.v/.zarray" "/gpfsnyu/scratch/kf2395/jukemir_t5/pretrain/checkpoint_100.tmp-xxxxxxxxxxxx/state.param_states.decoder.layers_1.pre_cross_attention_layer_norm.scale.v/.zarray" "/gpfsnyu/scratch/kf2395/jukemir_t5/pretrain/checkpoint_100.tmp-xxxxxxxxxxxx/state.param_states.decoder.layers_1.pre_mlp_attention_layer_norm.scale.v/.zarray" "/gpfsnyu/scratch/kf2395/jukemir_t5/pretrain/checkpoint_100.tmp-xxxxxxxxxxxx/state.param_states.decoder.decoder_norm.scale.v/.zarray"

kunfang98927 commented 2 years ago

I just got the administrator's reply which told me to use the /dev/shm or /tmp folder to save the checkpoints because these folders support POSIX lock. I tried to use these folders but this issue still existed.

ValueError: Error opening "zarr" driver: Error writing local file "/dev/shm/kf2395/pre/checkpoint_100.tmp-1651076047/state.param_states.decoder.layers_0.pre_cross_attention_layer_norm.scale.v/.zarray": Failed to acquire lock on file: /dev/shm/kf2395/pre/checkpoint_100.tmp-1651076047/state.param_states.decoder.layers_0.pre_cross_attention_layer_norm.scale.v/.zarray.__lock [OS error: Invalid argument]

jbms commented 2 years ago

What is the Linux kernel version of the machine you are using? You can determine that by running the uname -a command.

TensorStore uses the "open file descriptor locks" API, which was added in Linux kernel version 3.15 (released June 2014).

If you have an older kernel than that, it would explain the error --- and it may be that your cluster filesystem does support locking after all.

Presumably /dev/shm or /tmp are temporary storage local to a single machine so if you are using multiple machines as part of the single training job that won't work.

kunfang98927 commented 2 years ago

Thanks a lot! The following is the information I got through running uname -a

Linux login2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

So do you have any suggestions if I want to successfully train a t5x model on this machine?

jbms commented 2 years ago

Easiest thing would be to upgrade the kernel, but I imagine that isn't going to happen, though you could always check with the cluster administrator. Is there a different cluster with a newer kernel that you could use instead?

Unfortunately there is no option at the moment to disable locking or use a different lock strategy, so some patching of the tensorstore code will be required.

To clarify, is a given model being trained using just a single machine, or with multiple machines in parallel (all writing to the same checkpoint)?

If the /gpfsnyu filesystem actually does support POSIX locking, then there is a fairly simple change to the tensorstore code that should allow it to work, by using the older file locking API that has some problems but will probably be okay for your use case.

laramiel commented 2 years ago

Based on the filename, I assume that this is using GPFS on a NYU high-performance cluster. GPFS does, I believe, support posix locks. I don't know if it has a separate api/library from the os-provided filesystem.

You could see if there's cloud storage option available for use.

kunfang98927 commented 2 years ago

Easiest thing would be to upgrade the kernel, but I imagine that isn't going to happen, though you could always check with the cluster administrator. Is there a different cluster with a newer kernel that you could use instead?

Unfortunately there is no option at the moment to disable locking or use a different lock strategy, so some patching of the tensorstore code will be required.

To clarify, is a given model being trained using just a single machine, or with multiple machines in parallel (all writing to the same checkpoint)?

If the /gpfsnyu filesystem actually does support POSIX locking, then there is a fairly simple change to the tensorstore code that should allow it to work, by using the older file locking API that has some problems but will probably be okay for your use case.

The model is trained using a single machine. Thanks for the suggestions!

jbms commented 2 years ago

You can workaround this issue by checking out this repository:

git clone https://github.com/google/tensorstore

Then changing the following 2 lines:

https://github.com/google/tensorstore/blob/fe7aae44a788dbdfc731ea1c00b83f0d51e4f2f4/tensorstore/kvstore/file/posix_file_util.cc#L66 https://github.com/google/tensorstore/blob/fe7aae44a788dbdfc731ea1c00b83f0d51e4f2f4/tensorstore/kvstore/file/posix_file_util.cc#L90

to

#if 0

in order to force it to use the flock API which is currently used only on BSD/macOS. The flock API does not work correctly on Linux with network filesystems, but that isn't an issue if you are only using 1 machine.

Then you can install the Python package by going to the root directory of the checkout and running:

pip install .

jbms commented 2 years ago

Note: Building TensorStore requires GCC version 9 or newer. If your cluster only has an old version of GCC, you can instead build wheel packages for all supported Python versions using docker on another Linux machine as follows:

./tools/ci/cibuildwheel.py -- --platform linux

That will write the wheels to the dist/ directory. You can copy the package for the version of Python you are using to your cluster machine and install it there with pip install xxx.whl

kunfang98927 commented 2 years ago

Based on the filename, I assume that this is using GPFS on a NYU high-performance cluster. GPFS does, I believe, support posix locks. I don't know if it has a separate api/library from the os-provided filesystem.

You could see if there's cloud storage option available for use.

Thanks for your suggestion! I tried to use Google Cloud Storage but still met some problems... The bucket I created is called jukemir-t5 and I tried to save my checkpoints at gs://jukemir-t5/pretrain-model/pre/ I put the following environment variables in ~/.bashrc file.

export TENSORSTORE_GCS_HTTP_URL='https://storage.googleapis.com/jukemir-t5/' export GOOGLE_APPLICATION_CREDENTIALS='/gpfsnyu/scratch/kf2395/impressive-hull-347212-9ed6d5acf8ff.json' export TENSORSTORE_CA_BUNDLE="/etc/ssl/certs/ca-bundle.crt"

And I got these error messages:

2022-04-28 20:13:46.246569: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: www.googleapis.com". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata". Traceback (most recent call last): File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/gpfsnyu/scratch/kf2395/jukemir_t5/t5x/train.py", line 665, in gin_utils.run(main) File "/gpfsnyu/scratch/kf2395/jukemir_t5/t5x/gin_utils.py", line 107, in run flags_parser=lambda a: app.parse_flags_with_usage(rewrite_gin_args(a))) File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "/gpfsnyu/scratch/kf2395/jukemir_t5/t5x/train.py", line 643, in main _main(argv) File "/gpfsnyu/scratch/kf2395/jukemir_t5/t5x/train.py", line 663, in _main train_using_gin() File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/gin/config.py", line 1605, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise raise proxy.with_traceback(exception.traceback) from None File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/gin/config.py", line 1582, in gin_wrapper return fn(*new_args, **new_kwargs) File "/gpfsnyu/scratch/kf2395/jukemir_t5/t5x/train.py", line 172, in train tf.io.gfile.makedirs(model_dir) File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/tensorflow/python/lib/io/file_io.py", line 511, in recursive_create_dir_v2 _pywrap_file_io.RecursivelyCreateDir(compat.path_to_bytes(path)) tensorflow.python.framework.errors_impl.FailedPreconditionError: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: www.googleapis.com when reading metadata of gs://jukemir-t5/pretrain-model/pre/ In call to configurable 'train' (<function train at 0x7fc93bd1fdd0>)

I'm not familiar with gcs and I think this error may be related to the incorrect environment variable setting. For example, I'm not quite sure if I set TENSORSTORE_GCS_HTTP_URL correctly because I have no idea how to find the required url...

jbms commented 2 years ago

You don't need to set TENSORSTORE_GCS_HTTP_URL --- that is only for special cases if you don't want to use the normal GCS server.

The error messages shown are from tensorflow, not tensorstore --- t5x uses the tensorflow filesystem API for some purposes.

Edited: I originally missed the full text of the error message and thought it was DNS related --- it appears that tensorflow is not using the credentials specified by GOOGLE_APPLICATION_CREDENTIALS and is instead trying other methods of finding credentials. I'm not sure why that is.

adarob commented 2 years ago

The problem seems to be that you don't have the auth tokens for the gcs bucket.

kunfang98927 commented 2 years ago

Thanks for your kindly help! I read through Cloud Storage authentication documentations but still feel confused. I try to run the following commands to get the token but I'm not sure if this is exactly what I need...

gcloud auth activate-service-account --key-file KEY_FILE gcloud auth print-access-token

My question is: a) Did I get the correct auth token for the gcs bucket through the commands above? If not, how to get the correct one? b) After I get the correct one, where should I put the token to?

I gave allUsers the following permission:

Storage Legacy Bucket Reader Storage Legacy Bucket Writer Storage Legacy Object Owner Storage Object Viewer

And the following is the information of the GOOGLE_APPLICATION_CREDENTIALS path /gpfsnyu/scratch/kf2395/impressive-hull-347212-9ed6d5acf8ff.json

{ "type": "service_account", "project_id": "impressive-hull-347212", "private_key_id": "9ed6d5acf8ff4d840e4e912bfd266f05ef5dda36", "private_key": "-----BEGIN PRIVATE KEY-----{The private key information is omitted here}-----END PRIVATE KEY-----\n", "client_email": "939896784468-compute@developer.gserviceaccount.com", "client_id": "114542517032254052958", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/939896784468-compute%40developer.gserviceaccount.com" }

I ran verifying authentication code as follows...

>>> from google.cloud import storage
>>> storage_client = storage.Client()
>>> buckets = list(storage_client.list_buckets())
>>> print(buckets)
[<Bucket: jukemir-t5>, <Bucket: jukemir_t5_pretrain_model>]
>>>

kunfang98927 commented 2 years ago

In order to simplify this issue, I did a small experiment to reproduce the error. First I run sbatch test_error.sh

Here is the code of test_error.sh

#!/bin/bash

#SBATCH -p aquila,gpu # Partition to submit to

#SBATCH -n 1 # Number of cores

#SBATCH -N 1 # Number of nodes

#SBATCH --mem=100G # Memory pool for all cores, MB

#SBATCH -t 0-8:00

#SBATCH -o myjob.o # File to which STDOUT will be written

#SBATCH -e myjob.e # File to which STDERR will be written

#SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL

#SBATCH --mail-user=kf2395@nyu.edu # Email which notifications will be sent  

#SBATCH --gres=gpu:4 # How much gpu need, n is the number

module load cuda/11.2.2
python -c "import tensorflow as tf; tf.io.gfile.makedirs('gs://jukemir-t5/pretrain-model/pre/')"

And the error message in myjob.e is as follows, which is the same as the issue I posted before...(I think the UserWarning about cloud_tpu_init has nothing to do with this issue so maybe we can ignore it here...)

2022-04-29 17:33:22.818645: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: www.googleapis.com". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/jax/__init__.py:27: UserWarning: cloud_tpu_init failed: ConnectionError(MaxRetryError("HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/attributes/agent-worker-number (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f30dab30190>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))"))
 This a JAX bug; please report an issue at https://github.com/google/jax/issues
  _warn(f"cloud_tpu_init failed: {repr(exc)}\n This a JAX bug; please report "
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/tensorflow/python/lib/io/file_io.py", line 511, in recursive_create_dir_v2
    _pywrap_file_io.RecursivelyCreateDir(compat.path_to_bytes(path))
tensorflow.python.framework.errors_impl.FailedPreconditionError: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: www.googleapis.com
     when reading metadata of gs://jukemir-t5/pretrain-model/pre/

An interesting thing I discover is that if I don't submit test_error.sh to any host (such as aquila or gpu listed in test_error.sh) and just run python -c "import tensorflow as tf; tf.io.gfile.makedirs('gs://jukemir-t5/pretrain-model/pre/')" in the default host (login2), this error disappears. (However, the login2 host doesn't have enough computing resource so I have to use other host such as aquila... )

Although I found the source of the problem, I have no idea how to fix it...

kunfang98927 commented 2 years ago

Here is another experiment I do to prove the source of this issue... I run tf.io.gfile.listdir("gs://jukemir-t5/pretrain-model/") in the virtual machine locally (host name: login2) and submit this code to the aquila host respectively to compare the results.

This is the result of running in virtual machine locally (host name: login2)

(/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7) [kf2395@login2 jukemir_t5]$ python
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/jax/__init__.py:27: UserWarning: cloud_tpu_init failed: ConnectionError(MaxRetryError("HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/attributes/agent-worker-number (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff8f32db510>: Failed to establish a new connection: [Errno -2] Name or service not known'))"))
 This a JAX bug; please report an issue at https://github.com/google/jax/issues
  _warn(f"cloud_tpu_init failed: {repr(exc)}\n This a JAX bug; please report "
>>> tf.io.gfile.listdir("gs://jukemir-t5/pretrain-model/")
['pre/', 'pretrain-model/', 'pretrain/']

The above proves that the gs://jukemir-t5/pretrain-model/ path does have three folders. And the following is the error message running in aquila host using sbatch command:

2022-04-29 22:42:22.642630: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: www.googleapis.com". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/jax/__init__.py:27: UserWarning: cloud_tpu_init failed: ConnectionError(MaxRetryError("HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/attributes/agent-worker-number (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbeee9bf4d0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))"))
 This a JAX bug; please report an issue at https://github.com/google/jax/issues
  _warn(f"cloud_tpu_init failed: {repr(exc)}\n This a JAX bug; please report "
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/gpfsnyu/scratch/kf2395/.cache/env/tf2-gpu-py3.7/lib/python3.7/site-packages/tensorflow/python/lib/io/file_io.py", line 769, in list_directory_v2
    message="Could not find directory {}".format(path))
tensorflow.python.framework.errors_impl.NotFoundError: Could not find directory gs://jukemir-t5/pretrain-model

jbms commented 2 years ago

.bashrc is only evaluated for interactive shell sessions. You need to arrange for the GOOGLE_APPLICATION_CREDENTIALS environment variable to be set when running as a batch job.

kunfang98927 commented 2 years ago

I added #SBATCH --export=ALL but it still didn't work...

ibulu commented 2 years ago

experiencing the same issue on an Azure VM

ibulu commented 2 years ago

experiencing the same issue on an Azure VM suggestion to use /tmp folder as model_dir fixes this issue... at least for Azure VMs.

jbms commented 2 years ago

What is the filesystem you are using on the azure vms?

ibulu commented 2 years ago

What is the filesystem you are using on the azure vms? I believe it is ext4

laramiel commented 2 years ago

It could be that the gcs filesystem is not installed in tensorflow. I don't know how that would happen, but see:

https://pypi.org/project/tensorflow-io-gcs-filesystem/

jbms commented 2 years ago

Oh yes, sorry I misunderstood. I thought you were referring to the locking-related error.

ibulu commented 2 years ago

Oh yes, sorry I misunderstood. I thought you were referring to the locking-related error.

in my case, it was definitely the locking related error...and simply using /tmp folder (as suggested above) fixed the issue for me

jbms commented 2 years ago

Can you provide some information then about the azure vm --- the machine configuration, image (if public) or otherwise Linux distribution version/kernel version --- so that we can look into the locking issue?

kunfang98927 commented 2 years ago

Oh yes, sorry I misunderstood. I thought you were referring to the locking-related error.

in my case, it was definitely the locking related error...and simply using /tmp folder (as suggested above) fixed the issue for me

I tried both /dev/shm and /tmp folders to save checkpoints but still met the locking related error...

jbms commented 2 years ago

@Monalisa98927 With your kernel version it is a kernel limitation, so it won't make any difference what local filesystem you use.

ibulu commented 2 years ago

Can you provide some information then about the azure vm --- the machine configuration, image (if public) or otherwise Linux distribution version/kernel version --- so that we can look into the locking issue?

it is a Standard_NC12s_v3 instance that I created through Azure ML Studio. Here is all the info I could find:

Operating System: Ubuntu 18.04.6 LTS Kernel: Linux 5.4.0-1077-azure Architecture: x86-64

google-research / t5x

Error when saving the checkpoint #452