microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.06k stars 4.06k forks source link

[BUG] ImportError: /root/.cache/torch_extensions/py38_cu117/utils/utils.so: cannot open shared object file: No such file or directory #3356

Closed simajiucai closed 1 year ago

simajiucai commented 1 year ago

I am trying to use Accelerate and Deepspeed for training, but I encountered the following error:

ImportError: /root/.cache/torch_extensions/py38_cu117/utils/utils.so: cannot open shared object file: No such fi
le or directory

My Accelerate config:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 4
  gradient_clipping: 1.0
  zero3_init_flag: true
  zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_config: {}
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

and my ds_report:

op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch version .................... 2.0.0+cu117
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

Here is the script that you can run it directly by accelerate launch --mixed_precision="fp16" train_toy.py:

#!/usr/bin/env python
# coding=utf-8

from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor

import datasets as datasets_1
import torch
import torch.utils.checkpoint
import transformers
from accelerate import Accelerator
from accelerate.logging import get_logger
from accelerate.utils import ProjectConfiguration

from transformers import CLIPTextModel, CLIPTokenizer
from accelerate.utils import DummyOptim
import diffusers
from diffusers import AutoencoderKL, DDPMScheduler, UNet2DConditionModel
from diffusers.utils import check_min_version

# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.15.0.dev0")

logger = get_logger(__name__, log_level="INFO")

dataset_name_mapping = {
    "lambdalabs/pokemon-blip-captions": ("image", "text"),
}

def main():

    accelerator_project_config = ProjectConfiguration(total_limit=None)

    accelerator = Accelerator(
        gradient_accumulation_steps=4,
        mixed_precision=None,
        project_config=accelerator_project_config,
    )

    if accelerator.is_local_main_process:
        datasets_1.utils.logging.set_verbosity_warning()
        transformers.utils.logging.set_verbosity_warning()
        diffusers.utils.logging.set_verbosity_info()
    else:
        datasets_1.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()
        diffusers.utils.logging.set_verbosity_error()

    # Load scheduler, tokenizer and models.
    pretrained_model_name_or_path = "stabilityai/stable-diffusion-2-1"

    text_encoder = CLIPTextModel.from_pretrained(
        pretrained_model_name_or_path, subfolder="text_encoder"
    )

    vae = AutoencoderKL.from_pretrained(pretrained_model_name_or_path, subfolder="vae")
    unet = UNet2DConditionModel.from_pretrained(
        pretrained_model_name_or_path, subfolder="unet"
    )

    optimizer_cls = (
        torch.optim.AdamW
        if accelerator.state.deepspeed_plugin is None
           or "optimizer" not in accelerator.state.deepspeed_plugin.deepspeed_config
        else DummyOptim
    )
    # optimizer_cls = deepspeed.ops.adam.DeepSpeedCPUAdam
    optimizer = optimizer_cls(
        text_encoder.parameters(),
        lr=0.0001,
        betas=(0.9, 0.999),
        weight_decay=0.0001,
        eps=0.00000001,
    )
    test_data = datasets.FashionMNIST(
        root="data",
        train=False,
        download=True,
        transform=ToTensor()
    )
    train_dataloader = torch.utils.data.DataLoader(test_data, batch_size=2)

    unet, vae, text_encoder, optimizer,train_dataloader = accelerator.prepare(
        unet, vae, text_encoder, optimizer,train_dataloader
    )

if __name__ == "__main__":
    main()

The complete error message is:

Traceback (most recent call last):
  File "train_toy.py", line 106, in <module>
    main()
  File "train_toy.py", line 101, in main
    unet, vae, text_encoder, optimizer,train_dataloader = accelerator.prepare(
  File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/accelerate/accelerator.py", line 1090, in p
repare
    result = self._prepare_deepspeed(*args)
  File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/accelerate/accelerator.py", line 1367, in _
prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/deepspeed/__init__.py", line 165, in initia
lize
    engine = DeepSpeedEngine(args=args,
  File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in 
__init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1167, in
 _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1398, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 154, in __init__
    util_ops = UtilsBuilder().load()
  File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
    return self.jit_load(verbose)
  File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
  op_module = load(name=self.name,
  File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
  File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 556, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1166, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu117/utils/utils.so: cannot open shared object file: No such file or directory
ryuzakace commented 1 year ago

similar issue I am facing. ImportError: /root/.cache/torch_extensions/py38_cu102/transformer_inference/transformer_inference.so: cannot open shared object file: No such file or directory

were you able to resolve?

tjruwase commented 1 year ago

This problem could be because the extensions folder is located in /root, which is privileged. Can you try using /tmp instead by setting export TORCH_EXTENSIONS_DIR=/tmp

jockeyyan commented 1 year ago

similar issue I am facing. ImportError: /root/.cache/torch_extensions/py38_cu102/transformer_inference/transformer_inference.so: cannot open shared object file: No such file or directory

were you able to resolve?

Same as above issues, I check the DeepSpeed with ds_report and maybe you should install DeepSpeed with pre-install ops, not jit mode.

feimadecaogaozhi commented 1 year ago

similar issue I am facing. ImportError: /root/.cache/torch_extensions/py38_cu102/transformer_inference/transformer_inference.so: cannot open shared object file: No such file or directory

were you able to resolve?

Same as above issues

tjruwase commented 1 year ago

@feimadecaogaozhi, did you try changing the extensions folder as suggested above?

QianMuluo commented 1 year ago

similar issue I am facing. ImportError: /root/.cache/torch_extensions/py38_cu102/transformer_inference/transformer_inference.so: cannot open shared object file: No such file or directory

were you able to resolve?

Maybe you can check whether you have install transformers both use 'pip' and "conda"; “中文说明一下,如果是多卡训练,pip和conda分别安装了transformer,会导致冲突发生,但是在单卡上可能不会遇到这种问题”

QianMuluo commented 1 year ago

I also have meet the problem, and then I find this is because gcc in the computer in lower than 5.0.0, when promot the gcc version, then I solved the problem. "可能是因为linux系统上gcc的版本太低,不支持deepspeed所需要的及时编译所需要的参数配置,将gcc版本提高不低于版本5就行了"

yulingao commented 1 year ago

I also have meet the problem, and then I find this is because gcc in the computer in lower than 5.0.0, when promot the gcc version, then I solved the problem. "可能是因为linux系统上gcc的版本太低,不支持deepspeed所需要的及时编译所需要的参数配置,将gcc版本提高不低于版本5就行了"

don't work for me

zjjott commented 1 year ago

when using accelerate,it will start multiprocess, and they all triggle JIT compile,cause this issue; we can triggle deepspeed JIT compile before running task:

python -c "from deepspeed.ops.op_builder import UtilsBuilder;UtilsBuilder().load()"
Vince-Lau commented 1 year ago

This problem could be because the extensions folder is located in /root, which is privileged. Can you try using /tmp instead by setting export TORCH_EXTENSIONS_DIR=/tmp

I try this, it's work

jomayeri commented 1 year ago

Closing as it seems a solution was found.