microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.79k stars 4.05k forks source link

[BUG]error: can't copy 'deepspeed/accelerator': doesn't exist or not a regular file #3207

Open ucas010 opened 1 year ago

ucas010 commented 1 year ago

Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior: the official doc

git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed
pip install .

bug

      writing manifest file 'deepspeed.egg-info/SOURCES.txt'
      error: can't copy 'deepspeed/accelerator': doesn't exist or not a regular file
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for deepspeed
  Running setup.py clean for deepspeed
Failed to build deepspeed
Installing collected packages: deepspeed
  Running setup.py install for deepspeed ... error
  error: subprocess-exited-with-error

  × Running setup.py install for deepspeed did not run successfully.
  │ exit code: 1
  ╰─> [355 lines of output]
      DS_BUILD_OPS=0
      Install Ops={'async_io': False, 'cpu_adagrad': False, 'cpu_adam': False, 'fused_adam': False, 'fused_lamb': False, 'quantizer': False, 'random_ltd': False, 'sparse_attn': False, 'spatial_inference': False, 'transformer': False, 'stochastic_transformer': False, 'transformer_inference': False, 'utils': False}
      version=0.9.0+0b5252b, git_hash=0b5252b, git_branch=master
      install_requires=['hjson', 'ninja', 'numpy', 'packaging>=20.0', 'psutil', 'py-cpuinfo', 'pydantic', 'torch', 'tqdm']
      compatible_ops={'async_io': True, 'cpu_adagrad': True, 'cpu_adam': True, 'fused_adam': True, 'fused_lamb': True, 'quantizer': True, 'random_ltd': True, 'sparse_attn': True, 'spatial_inference': True, 'transformer': True, 'stochastic_transformer': True, 'transformer_inference': True, 'utils': True}
      ext_modules=[]

Expected behavior A clear and concise description of what you expected to happen.

ds_report output Please run ds_report to give us details about your setup.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

liulhdarks commented 1 year ago

I have solved it. Setup.py packages does not support soft links. You need to comment the following code in seup.py first:

  create_dir_symlink('..\\..\\csrc', '.\\deepspeed\\ops\\csrc')
  create_dir_symlink('..\\..\\op_builder', '.\\deepspeed\\ops\\op_builder')
  create_dir_symlink('..\\accelerator', '.\\deepspeed\\accelerator')

And then manually copy csrc, op_builder and accelerator to the corresponding directory.

ldilov commented 1 year ago

@ucas010 The bug is in setup.py. On line ~270+ where the setup() call is made, you have to add one more argument:

package_dir={"": "."},

so it should become:

setup(name='deepspeed',
      version=version_str,
      description='DeepSpeed library',
      long_description=readme_text,
      long_description_content_type='text/markdown',
      author='DeepSpeed Team',
      author_email='deepspeed-info@microsoft.com',
      url='http://deepspeed.ai',
      project_urls={
          'Documentation': 'https://deepspeed.readthedocs.io',
          'Source': 'https://github.com/microsoft/DeepSpeed',
      },
      install_requires=install_requires,
      extras_require=extras_require,
      packages=find_packages(include=['deepspeed', 'deepspeed.*']),
      package_dir={"": "."},
      include_package_data=True,
      scripts=[
          'bin/deepspeed', 'bin/deepspeed.pt', 'bin/ds', 'bin/ds_ssh', 'bin/ds_report', 'bin/ds_bench', 'bin/dsr',
          'bin/ds_elastic'
      ],
      classifiers=[
          'Programming Language :: Python :: 3.6', 'Programming Language :: Python :: 3.7',
          'Programming Language :: Python :: 3.8', 'Programming Language :: Python :: 3.9',
          'Programming Language :: Python :: 3.10'
      ],
      license='MIT',
      ext_modules=ext_modules,
      cmdclass=cmdclass)

Also if you are running Windows and you encounter compile errors like error C2398: Element '2': conversion from 'size_t' to '_Ty' requires a narrowing conversion`

Consider going to: deepspeed\csrc\transformer\inference\csrc\pt_binding.cpp . There you have to make two typecasts:

On prev_key:


auto prev_key = torch::from_blob(workspace + offset,
                                     {bsz, heads, all_tokens, k},
                                     {hidden_dim * InferenceContext::Instance().GetMaxTokenLenght(),
                                      k * InferenceContext::Instance().GetMaxTokenLenght(),
                                      k,
                                      1},
                                     options);

to become:


auto prev_key = torch::from_blob(workspace + offset,
                                     {bsz, heads, all_tokens, k},
                                     {static_cast<int64_t>(hidden_dim * InferenceContext::Instance().GetMaxTokenLenght()),
                                      static_cast<int64_t>(k * InferenceContext::Instance().GetMaxTokenLenght()),
                                      k,
                                      1},
                                     options);              

Repeat the same typecast for prev_value. What the error means is that basically the second argument (the array), has value of size size_t which is uint64 while we expect int64. We cast all to int64 since int64's max positive value is pretty large and safe

Flamefire commented 1 year ago

I'm seeing the same issue: error: can't copy 'deepspeed/accelerator': doesn't exist or not a regular file

And indeed: This is a symlink after https://github.com/microsoft/DeepSpeed/pull/2560 got merged: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/accelerator

So it doesn't even required the line in setup.py which creates the symlink, which now(?) is run on Windows only anyway. Hence removing this doesn't solve the issue.

It is furthermore complicated by this not always happening. Haven't fully verified but I suspect it only appears when the ~wheel ~ setuptools_scm package is installed, which isn't always the case.

The package_dir={"": "."}, line addition from @ldilov fixes this for me. However then the package_data (such as deepspeed/ops/csrc) is missing

By experimenting with a environment where it works and comparing to one where it doesn't I found the issue occurs when setuptools_scm is installed.

Flamefire commented 1 year ago

I found the related bug #1909 and the solution there: https://github.com/microsoft/DeepSpeed/issues/1909#issuecomment-1225113348

Basically:

rm deepspeed/ops/{csrc,op_builder}
rm deepspeed/accelerator
cp -R csrc op_builder deepspeed/ops/
cp -R accelerator deepspeed/

And all works as far as I can tell.

I'd suggest to not use symlinks in this repo at all which will avoid this issue in the first place.

Or if you have too for development ease: Do the other way round: Create symlinks to the convenience places not where the files are actually required (i.e. reverse source and target)

mrwyattii commented 1 year ago

Currently working on #4323 to remove the symlinks and hopefully resolve this issue. Please try that PR if you are still seeing this error.

vTuanpham commented 9 months ago

I found the related bug #1909 and the solution there: #1909 (comment)

Basically:

rm deepspeed/ops/{csrc,op_builder}
rm deepspeed/accelerator
cp -R csrc op_builder deepspeed/ops/
cp -R accelerator deepspeed/

And all works as far as I can tell.

I'd suggest to not use symlinks in this repo at all which will avoid this issue in the first place.

Or if you have too for development ease: Do the other way round: Create symlinks to the convenience places not where the files are actually required (i.e. reverse source and target)

This work for me, thanks!

sirus20x6 commented 2 weeks ago

this did not fix it for me. on arch linux. still says error: can't copy 'deepspeed/accelerator': doesn't exist or not a regular file even though I deleted the symlinks and copied the directories over.

tried to compile with: HIP_PLATFORM="amd" DS_BUILD_CPU_ADAM=1 TORCH_HIP_ARCH_LIST="gfx1100" python setup.py install

sirus20x6 commented 2 weeks ago

spoke too soon. changing the symlinks for copying didn't work, but adding package_dir={"": "."}, did fix it