pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.13k stars 151 forks source link

`v2.1.2+cu118` and `v2.1.1+cu118` run into torchdata `ImportError: libssl.so.3: cannot open shared object file: No such file or directory`, that `v2.1.0+cu118` doesn't have an issue with #1220

Open justinxzhao opened 9 months ago

justinxzhao commented 9 months ago

🐛 Describe the bug

We are noticing a strange error specifically when using torch2.1.1+cu118 and torch2.1.2+cu118 , that is not an issue with torch2.1.0+cu118.

The error looks like this:

Traceback (most recent call last):
    from ludwig.api import LudwigModel
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/api.py", line 41, in <module>
    from ludwig.backend import Backend, initialize_backend, provision_preprocessing_workers
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/backend/__init__.py", line 22, in <module>
    from ludwig.backend.base import Backend, LocalBackend
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/backend/base.py", line 34, in <module>
    from ludwig.data.cache.manager import CacheManager
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/data/cache/manager.py", line 8, in <module>
    from ludwig.data.dataset.base import DatasetManager
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/data/dataset/base.py", line 24, in <module>
    from ludwig.distributed import DistributedStrategy
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/distributed/__init__.py", line 3, in <module>
    from ludwig.distributed.base import DistributedStrategy, LocalStrategy
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/distributed/base.py", line 11, in <module>
    from ludwig.modules.optimization_modules import create_optimizer
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/modules/optimization_modules.py", line 21, in <module>
    from ludwig.utils.torch_utils import LudwigModule
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/utils/torch_utils.py", line 14, in <module>
    from ludwig.utils.strings_utils import SpecialSymbol
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/utils/strings_utils.py", line 33, in <module>
    from ludwig.utils.tokenizers import get_tokenizer_from_registry
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/utils/tokenizers.py", line 21, in <module>
    import torchtext
  File "/home/ray/anaconda3/lib/python3.8/site-packages/torchtext/__init__.py", line 12, in <module>
    from . import data, datasets, prototype, functional, models, nn, transforms, utils, vocab, experimental
  File "/home/ray/anaconda3/lib/python3.8/site-packages/torchtext/datasets/__init__.py", line 3, in <module>
    from .ag_news import AG_NEWS
  File "/home/ray/anaconda3/lib/python3.8/site-packages/torchtext/datasets/ag_news.py", line 5, in <module>
    from torchdata.datapipes.iter import FileOpener, IterableWrapper
  File "/home/ray/anaconda3/lib/python3.8/site-packages/torchdata/__init__.py", line 7, in <module>
    from torchdata import _extension  # noqa: F401
  File "/home/ray/anaconda3/lib/python3.8/site-packages/torchdata/_extension.py", line 34, in <module>
    _init_extension()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/torchdata/_extension.py", line 31, in _init_extension
    from torchdata import _torchdata as _torchdata
ImportError: libssl.so.3: cannot open shared object file: No such file or directory

It seems like there's some complaint about torchdata, which seems to install with urllib3>2.0.

When trying to install with urllib3==1.26.16 to try to mitigate the libssl.so error, then we get a different error:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1382, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/home/ray/anaconda3/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/ray/anaconda3/lib/python3.8/site-packages/transformers/generation/utils.py", line 28, in <module>
    from ..integrations.deepspeed import is_deepspeed_zero3_enabled
  File "/home/ray/anaconda3/lib/python3.8/site-packages/transformers/integrations/deepspeed.py", line 49, in <module>
    from accelerate.utils.deepspeed import HfDeepSpeedConfig as DeepSpeedConfig
  File "/home/ray/anaconda3/lib/python3.8/site-packages/accelerate/__init__.py", line 3, in <module>
    from .accelerator import Accelerator
  File "/home/ray/anaconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 35, in <module>
    from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
  File "/home/ray/anaconda3/lib/python3.8/site-packages/accelerate/checkpointing.py", line 24, in <module>
    from .utils import (
  File "/home/ray/anaconda3/lib/python3.8/site-packages/accelerate/utils/__init__.py", line 153, in <module>
    from .launch import (
  File "/home/ray/anaconda3/lib/python3.8/site-packages/accelerate/utils/launch.py", line 24, in <module>
    from ..commands.config.config_args import SageMakerConfig
  File "/home/ray/anaconda3/lib/python3.8/site-packages/accelerate/commands/config/__init__.py", line 19, in <module>
    from .config import config_command_parser
  File "/home/ray/anaconda3/lib/python3.8/site-packages/accelerate/commands/config/config.py", line 25, in <module>
    from .sagemaker import get_sagemaker_input
  File "/home/ray/anaconda3/lib/python3.8/site-packages/accelerate/commands/config/sagemaker.py", line 35, in <module>
    import boto3  # noqa: F401
  File "/home/ray/anaconda3/lib/python3.8/site-packages/boto3/__init__.py", line 17, in <module>
    from boto3.session import Session
  File "/home/ray/anaconda3/lib/python3.8/site-packages/boto3/session.py", line 17, in <module>
    import botocore.session
  File "/home/ray/anaconda3/lib/python3.8/site-packages/botocore/session.py", line 26, in <module>
    import botocore.client
  File "/home/ray/anaconda3/lib/python3.8/site-packages/botocore/client.py", line 15, in <module>
    from botocore import waiter, xform_name
  File "/home/ray/anaconda3/lib/python3.8/site-packages/botocore/waiter.py", line 18, in <module>
    from botocore.docs.docstring import WaiterDocstring
  File "/home/ray/anaconda3/lib/python3.8/site-packages/botocore/docs/__init__.py", line 15, in <module>
    from botocore.docs.service import ServiceDocumenter
  File "/home/ray/anaconda3/lib/python3.8/site-packages/botocore/docs/service.py", line 14, in <module>
    from botocore.docs.client import ClientDocumenter, ClientExceptionsDocumenter
  File "/home/ray/anaconda3/lib/python3.8/site-packages/botocore/docs/client.py", line 14, in <module>
    from botocore.docs.example import ResponseExampleDocumenter
  File "/home/ray/anaconda3/lib/python3.8/site-packages/botocore/docs/example.py", line 13, in <module>
    from botocore.docs.shape import ShapeDocumenter
  File "/home/ray/anaconda3/lib/python3.8/site-packages/botocore/docs/shape.py", line 19, in <module>
    from botocore.utils import is_json_value_header
  File "/home/ray/anaconda3/lib/python3.8/site-packages/botocore/utils.py", line 34, in <module>
    import botocore.httpsession
  File "/home/ray/anaconda3/lib/python3.8/site-packages/botocore/httpsession.py", line 21, in <module>
    from urllib3.util.ssl_ import (
ImportError: cannot import name 'DEFAULT_CIPHERS' from 'urllib3.util.ssl_' (/home/ray/anaconda3/lib/python3.8/site-packages/urllib3/util/ssl_.py)

This suggests a different incompatibility (perhaps from deepspeed?).

Anyway, it seems like torch 2.1.0+cu118 doesn’t require the newest version of torchdata and/or it seems to work with urllib3==1.26.16, which appears to mitigate our issues.

However, the errors when trying to use 2.1.1+cu118 and 2.1.2+cu118 his seemed weird to me, so raising it here in case anyone had any helpful tidbits!

Versions

2.1.0+cu118 (works) 2.1.1+cu118 (broken) 2.1.2+cu118 (broken)

malfet commented 9 months ago

Transferring to the torchdata project, though please note that it's not really maintained by anyone right now