google / paxml

Pax is a Jax-based machine learning framework for training large scale models. Pax allows for advanced and fully configurable experimentation and parallelization, and has demonstrated industry leading model flop utilization rates.
Apache License 2.0
458 stars 69 forks source link

Error running Common Crawl example #11

Closed RobertLiJN closed 1 year ago

RobertLiJN commented 1 year ago

Sorry to interrupt! When running

python3 .local/lib/python3.8/site-packages/paxml/main.py \
--exp=tasks.lm.params.c4.C4Spmd1BAdam4Replicas \
--job_log_dir=gs://<your-bucket> 

in the examples, I encountered the following error seeming to suggest I cannot load from the bucket provided in c4.py

Traceback (most recent call last):
  File ".local/lib/python3.8/site-packages/paxml/main.py", line 407, in <module>
    app.run(main, flags_parser=absl_flags.flags_parser)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File ".local/lib/python3.8/site-packages/paxml/main.py", line 382, in main
    run(experiment_config=experiment_config,
  File ".local/lib/python3.8/site-packages/paxml/main.py", line 336, in run
    search_space = tuning_lib.get_search_space(experiment_config)
  File "/home/robertli/.local/lib/python3.8/site-packages/paxml/tuning_lib.py", line 81, in get_search_space
    search_space = pg.hyper.trace(inspect_search_space, require_hyper_name=True)
  File "/home/robertli/.local/lib/python3.8/site-packages/pyglove/core/hyper/dynamic_evaluation.py", line 586, in trace
    fun()
  File "/home/robertli/.local/lib/python3.8/site-packages/paxml/tuning_lib.py", line 77, in inspect_search_space
    _ = instantiate(d)
  File "/home/robertli/.local/lib/python3.8/site-packages/praxis/base_hyperparams.py", line 1103, in instantiate
    return config.Instantiate(**kwargs)
  File "/home/robertli/.local/lib/python3.8/site-packages/praxis/base_hyperparams.py", line 601, in Instantiate
    return self.cls(self, **kwargs)
  File "/home/robertli/.local/lib/python3.8/site-packages/paxml/seqio_input.py", line 443, in __init__
    self._dataset = self._get_dataset()
  File "/home/robertli/.local/lib/python3.8/site-packages/paxml/seqio_input.py", line 551, in _get_dataset
    ds = self._get_backing_ds(
  File "/home/robertli/.local/lib/python3.8/site-packages/paxml/seqio_input.py", line 686, in _get_backing_ds
    ds = self.mixture_or_task.get_dataset(
  File "/home/robertli/.local/lib/python3.8/site-packages/seqio/dataset_providers.py", line 1205, in get_dataset
    len(self.source.list_shards(split=split)) >= shard_info.num_shards)
  File "/home/robertli/.local/lib/python3.8/site-packages/seqio/dataset_providers.py", line 455, in list_shards
    return [_get_filename(info) for info in self.tfds_dataset.files(split)]
  File "/home/robertli/.local/lib/python3.8/site-packages/seqio/utils.py", line 152, in files
    split_info = self.builder.info.splits[split]
  File "/home/robertli/.local/lib/python3.8/site-packages/seqio/utils.py", line 129, in builder
    LazyTfdsLoader._MEMOIZED_BUILDERS[builder_key] = tfds.builder(
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/home/robertli/.local/lib/python3.8/site-packages/tensorflow_datasets/core/logging/__init__.py", line 169, in __call__
    return function(*args, **kwargs)
  File "/home/robertli/.local/lib/python3.8/site-packages/tensorflow_datasets/core/load.py", line 202, in builder
    return read_only_builder.builder_from_files(str(name), **builder_kwargs)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/home/robertli/.local/lib/python3.8/site-packages/tensorflow_datasets/core/read_only_builder.py", line 259, in builder_from_files
    builder_dir = _find_builder_dir(name, **builder_kwargs)
  File "/home/robertli/.local/lib/python3.8/site-packages/tensorflow_datasets/core/read_only_builder.py", line 327, in _find_builder_dir
    builder_dir = _find_builder_dir_single_dir(
  File "/home/robertli/.local/lib/python3.8/site-packages/tensorflow_datasets/core/read_only_builder.py", line 417, in _find_builder_dir_single_dir
    found_version_str = _get_version_str(
  File "/home/robertli/.local/lib/python3.8/site-packages/tensorflow_datasets/core/read_only_builder.py", line 484, in _get_version_str
    all_versions = version_lib.list_all_versions(os.fspath(builder_dir))
  File "/home/robertli/.local/lib/python3.8/site-packages/tensorflow_datasets/core/utils/version.py", line 193, in list_all_versions
    if not root_dir.exists():
  File "/home/robertli/.local/lib/python3.8/site-packages/etils/epath/gpath.py", line 130, in exists
    return self._backend.exists(self._path_str)
  File "/home/robertli/.local/lib/python3.8/site-packages/etils/epath/backend.py", line 204, in exists
    return self.gfile.exists(path)
  File "/home/robertli/.local/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 288, in file_exists_v2
    _pywrap_file_io.FileExists(compat.path_to_bytes(path))
tensorflow.python.framework.errors_impl.PermissionDeniedError: Error executing an HTTP request: HTTP response code 403 with body '{
  "error": {
    "code": 403,
    "message": "991053624826-compute@developer.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).",
    "errors": [
      {
        "message": "991053624826-compute@developer.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist)."'
     when reading metadata of gs://mlperf-llm-public2/c4/en

I wonder if this is because I haven't configured something correctly, because the bucket seems like a public one.

I tried using the TFDS default bucket (gs://tfds-data/datasets) instead of gs://mlperf-llm-public2 and this problem doesn't arise, but it requires me to choose among available versions of c4 (not 3.0.4). Even then, I cannot proceed because it gives me some other error.

Thanks in advance for your attention and help!

mathemakitten commented 1 year ago

Just a quick note that I don't think the perms on gs://mlperf-llm-public2 are configured properly for public access — I can access buckets like gs://t5-data/vocabs/ no problem, but not this one. I get a similar error as above when trying to grab the spm file per the README (gs://mlperf-llm-public2/vocab/c4_en_301_5Mexp2_spm.model).

zhangqiaorjc commented 1 year ago

@mathemakitten sorry for the late reply, the mlperf-llm-public2 isn't supposed to be public yet