oracle / ocifs

ocifs provides a POSIX-compatible API wrapping Oracle Cloud Infrastructure's (OCI) Object Storage. ocifs is a python library that relies on the fsspec framework.
https://ocifs.readthedocs.io/en/latest/
Universal Permissive License v1.0
16 stars 9 forks source link

OCIFile.read() AttributeError #39

Open Skylion007 opened 5 months ago

Skylion007 commented 5 months ago

This one is really confusing to me. When using PyTorch lightning and loading checkpoints from OCI, I occasionally hit this weird error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/text-diffusion/main.py", line 155, in <module>
    main()
  File "/usr/lib/python3/dist-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/usr/lib/python3/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/lib/python3/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/lib/python3/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/lib/python3/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/usr/lib/python3/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
        ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/lib/python3/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/text-diffusion/main.py", line 151, in main
    _train(config, logger, tokenizer)
  File "/text-diffusion/main.py", line 135, in _train
    trainer.fit(model, train_ds, valid_ds, ckpt_path=ckpt_path)
  File "/usr/lib/python3/dist-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/usr/lib/python3/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/lib/python3/dist-packages/lightning/pytorch/trainer/trainer.py", line 956, in _run
    self._checkpoint_connector._restore_modules_and_callbacks(ckpt_path)
  File "/usr/lib/python3/dist-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 397, in _restore_modules_and_callbacks
    self.resume_start(checkpoint_path)
  File "/usr/lib/python3/dist-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 79, in resume_start
    loaded_checkpoint = self.trainer.strategy.load_checkpoint(checkpoint_path)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/lightning/pytorch/strategies/strategy.py", line 368, in load_checkpoint
    return self.checkpoint_io.load_checkpoint(checkpoint_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/lightning/fabric/plugins/io/torch_io.py", line 83, in load_checkpoint
    return pl_load(path, map_location=map_location)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/lightning/fabric/utilities/cloud_io.py", line 57, in _load
    return torch.load(f, map_location=map_location)  # type: ignore[arg-type]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/torch/serialization.py", line 1026, in load
    return _load(opened_zipfile,
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/torch/serialization.py", line 1438, in _load
    result = unpickler.load()
             ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/pickle.py", line 1213, in load
    dispatch[key[0]](self)
  File "/usr/lib/python3.11/pickle.py", line 1254, in load_binpersid
    self.append(self.persistent_load(pid))
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/torch/serialization.py", line 1408, in persistent_load
    typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/torch/serialization.py", line 1373, in load_tensor
    storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'OCIFile' object has no attribute 'read'

This confuses me as the attribute read() should be provided by fsspec. This makes me think it could be invalid error handling, but I am really confused as I only appear to hit this issue on certain runs (indicating the file/filepath maybe otherwise corrupted)? Should this be a FileNotFound or other error potentially?

Here are my installed packages in this run:

Requirement already satisfied: certifi in /usr/lib/python3/dist-packages (from oci>=2.43.1->ocifs) (2024.2.2)
Installing collected packages: webencodings, redo, pytz, pathtools, nvidia-ml-py, incremental, git-lfs, fastjsonschema, docopt, circuitbreaker, argparse, antlr4-python3-runtime, zope-interface, xxhash, websocket-client, webcolors, uri-template, tzdata, types-python-dateutil, tqdm, tornado, tinycss2, threadpoolctl, termcolor, soupsieve, sniffio, smmap, simplejson, shortuuid, setproctitle, sentry-sdk, send2trash, scipy, safetensors, rpds-py, rfc3986-validator, rfc3339-validator, regex, pyzmq, pyyaml, python-json-logger, python-dateutil, pyparsing, pycparser, pyarrow-hotfix, pyarrow, psutil, protobuf, promise, prometheus-client, platformdirs, pandocfilters, packaging, overrides, orderedmultidict, nest-asyncio, multidict, mistune, mdurl, kiwisolver, jupyterlab-pygments, jsonpointer, json5, joblib, hyperlink, hf_transfer, h5py, h11, greenlet, fsspec, frozenlist, fqdn, fonttools, docker-pycreds, dill, defusedxml, debugpy, cycler, contourpy, constantly, comm, Click, charset-normalizer, cachetools, bleach, babel, attrs, async-lru, yarl, terminado, sqlalchemy, scikit-learn, requests, referencing, pandas, omegaconf, nvitop, multiprocess, matplotlib, markdown-it-py, lightning-utilities, jupyter-core, httpcore, gitdb, furl, cffi, beautifulsoup4, automat, arrow, anyio, aiosignal, twisted, seaborn, rich, jupyter-server-terminals, jupyter-client, jsonschema-specifications, isoduration, hydra-core, huggingface-hub, httpx, GitPython, cryptography, argon2-cffi-bindings, aiohttp, wandb, torchmetrics, tokenizers, pyOpenSSL, jsonschema, ipykernel, ipdb, flash-attn, buildtools, argon2-cffi, transformers, pytorch-lightning, oci, nbformat, datasets, causal-conv1d, ocifs, nbclient, mamba-ssm, lightning, jupyter-events, nbconvert, jupyter-server, notebook-shim, jupyterlab-server, jupyter-lsp, jupyterlab, notebook
Successfully installed Click-8.1.7 GitPython-3.1.43 aiohttp-3.9.4 aiosignal-1.3.1 antlr4-python3-runtime-4.9.3 anyio-4.3.0 argon2-cffi-23.1.0 argon2-cffi-bindings-21.2.0 argparse-1.4.0 arrow-1.3.0 async-lru-2.0.4 attrs-23.2.0 automat-22.10.0 babel-2.14.0 beautifulsoup4-4.12.3 bleach-6.1.0 buildtools-1.0.6 cachetools-5.3.3 causal-conv1d-1.1.3.post1 cffi-1.16.0 charset-normalizer-3.3.2 circuitbreaker-1.4.0 comm-0.2.2 constantly-23.10.4 contourpy-1.2.1 cryptography-42.0.5 cycler-0.12.1 datasets-2.18.0 debugpy-1.8.1 defusedxml-0.7.1 dill-0.3.8 docker-pycreds-0.4.0 docopt-0.6.2 fastjsonschema-2.19.1 flash-attn-2.5.6 fonttools-4.51.0 fqdn-1.5.1 frozenlist-1.4.1 fsspec-2024.2.0 furl-2.1.3 git-lfs-1.6 gitdb-4.0.11 greenlet-3.0.3 h11-0.14.0 h5py-3.10.0 hf_transfer-0.1.6 httpcore-1.0.5 httpx-0.27.0 huggingface-hub-0.22.2 hydra-core-1.3.2 hyperlink-21.0.0 incremental-22.10.0 ipdb-0.13.13 ipykernel-6.29.4 isoduration-20.11.0 joblib-1.4.0 json5-0.9.24 jsonpointer-2.4 jsonschema-4.21.1 jsonschema-specifications-2023.12.1 jupyter-client-8.6.1 jupyter-core-5.7.2 jupyter-events-0.10.0 jupyter-lsp-2.2.5 jupyter-server-2.14.0 jupyter-server-terminals-0.5.3 jupyterlab-4.1.6 jupyterlab-pygments-0.3.0 jupyterlab-server-2.26.0 kiwisolver-1.4.5 lightning-2.2.1 lightning-utilities-0.11.2 mamba-ssm-1.1.4 markdown-it-py-3.0.0 matplotlib-3.8.4 mdurl-0.1.2 mistune-3.0.2 multidict-6.0.5 multiprocess-0.70.16 nbclient-0.10.0 nbconvert-7.16.3 nbformat-5.10.4 nest-asyncio-1.6.0 notebook-7.1.1 notebook-shim-0.2.4 nvidia-ml-py-12.535.133 nvitop-1.3.2 oci-2.125.2 ocifs-1.3.1 omegaconf-2.3.0 orderedmultidict-1.0.1 overrides-7.7.0 packaging-23.2 pandas-2.2.1 pandocfilters-1.5.1 pathtools-0.1.2 platformdirs-4.2.0 prometheus-client-0.20.0 promise-2.3 protobuf-4.25.3 psutil-5.9.8 pyOpenSSL-24.1.0 pyarrow-15.0.2 pyarrow-hotfix-0.6 pycparser-2.22 pyparsing-3.1.2 python-dateutil-2.9.0.post0 python-json-logger-2.0.7 pytorch-lightning-2.2.1 pytz-2024.1 pyyaml-6.0.1 pyzmq-25.1.2 redo-2.0.4 referencing-0.34.0 regex-2023.12.25 requests-2.31.0 rfc3339-validator-0.1.4 rfc3986-validator-0.1.1 rich-13.7.1 rpds-py-0.18.0 safetensors-0.4.2 scikit-learn-1.4.0 scipy-1.13.0 seaborn-0.13.2 send2trash-1.8.3 sentry-sdk-1.45.0 setproctitle-1.3.3 shortuuid-1.0.13 simplejson-3.19.2 smmap-5.0.1 sniffio-1.3.1 soupsieve-2.5 sqlalchemy-2.0.29 termcolor-2.4.0 terminado-0.18.1 threadpoolctl-3.4.0 tinycss2-1.2.1 tokenizers-0.15.2 torchmetrics-1.3.2 tornado-6.4 tqdm-4.66.2 transformers-4.38.2 twisted-24.3.0 types-python-dateutil-2.9.0.20240316 tzdata-2024.1 uri-template-1.3.0 wandb-0.13.5 webcolors-1.13 webencodings-0.5.1 websocket-client-1.7.0 xxhash-3.4.1 yarl-1.9.4 zope-interface-6.2
ahosler commented 5 months ago

Hey @Skylion007 , Thanks for raising this issue! Do you have a re-producible code snippet I could use?

Skylion007 commented 4 months ago

Appears as thought it might be a transient error related to when an Interrupt is sent. Looking into it further.