Open ucas010 opened 1 year ago
I have solved it. Setup.py packages does not support soft links. You need to comment the following code in seup.py first:
create_dir_symlink('..\\..\\csrc', '.\\deepspeed\\ops\\csrc')
create_dir_symlink('..\\..\\op_builder', '.\\deepspeed\\ops\\op_builder')
create_dir_symlink('..\\accelerator', '.\\deepspeed\\accelerator')
And then manually copy csrc, op_builder and accelerator to the corresponding directory.
@ucas010 The bug is in setup.py. On line ~270+ where the setup() call is made, you have to add one more argument:
package_dir={"": "."},
so it should become:
setup(name='deepspeed',
version=version_str,
description='DeepSpeed library',
long_description=readme_text,
long_description_content_type='text/markdown',
author='DeepSpeed Team',
author_email='deepspeed-info@microsoft.com',
url='http://deepspeed.ai',
project_urls={
'Documentation': 'https://deepspeed.readthedocs.io',
'Source': 'https://github.com/microsoft/DeepSpeed',
},
install_requires=install_requires,
extras_require=extras_require,
packages=find_packages(include=['deepspeed', 'deepspeed.*']),
package_dir={"": "."},
include_package_data=True,
scripts=[
'bin/deepspeed', 'bin/deepspeed.pt', 'bin/ds', 'bin/ds_ssh', 'bin/ds_report', 'bin/ds_bench', 'bin/dsr',
'bin/ds_elastic'
],
classifiers=[
'Programming Language :: Python :: 3.6', 'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8', 'Programming Language :: Python :: 3.9',
'Programming Language :: Python :: 3.10'
],
license='MIT',
ext_modules=ext_modules,
cmdclass=cmdclass)
Also if you are running Windows and you encounter compile errors like error
C2398:
Element '2': conversion from 'size_t' to '_Ty' requires a narrowing conversion`
Consider going to: deepspeed\csrc\transformer\inference\csrc\pt_binding.cpp . There you have to make two typecasts:
On prev_key:
auto prev_key = torch::from_blob(workspace + offset,
{bsz, heads, all_tokens, k},
{hidden_dim * InferenceContext::Instance().GetMaxTokenLenght(),
k * InferenceContext::Instance().GetMaxTokenLenght(),
k,
1},
options);
to become:
auto prev_key = torch::from_blob(workspace + offset,
{bsz, heads, all_tokens, k},
{static_cast<int64_t>(hidden_dim * InferenceContext::Instance().GetMaxTokenLenght()),
static_cast<int64_t>(k * InferenceContext::Instance().GetMaxTokenLenght()),
k,
1},
options);
Repeat the same typecast for prev_value. What the error means is that basically the second argument (the array), has value of size size_t which is uint64 while we expect int64. We cast all to int64 since int64's max positive value is pretty large and safe
I'm seeing the same issue: error: can't copy 'deepspeed/accelerator': doesn't exist or not a regular file
And indeed: This is a symlink after https://github.com/microsoft/DeepSpeed/pull/2560 got merged: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/accelerator
So it doesn't even required the line in setup.py
which creates the symlink, which now(?) is run on Windows only anyway. Hence removing this doesn't solve the issue.
It is furthermore complicated by this not always happening. Haven't fully verified but I suspect it only appears when the ~wheel
~ setuptools_scm
package is installed, which isn't always the case.
The package_dir={"": "."},
line addition from @ldilov fixes this for me. However then the package_data (such as deepspeed/ops/csrc
) is missing
By experimenting with a environment where it works and comparing to one where it doesn't I found the issue occurs when setuptools_scm
is installed.
I found the related bug #1909 and the solution there: https://github.com/microsoft/DeepSpeed/issues/1909#issuecomment-1225113348
Basically:
rm deepspeed/ops/{csrc,op_builder}
rm deepspeed/accelerator
cp -R csrc op_builder deepspeed/ops/
cp -R accelerator deepspeed/
And all works as far as I can tell.
I'd suggest to not use symlinks in this repo at all which will avoid this issue in the first place.
Or if you have too for development ease: Do the other way round: Create symlinks to the convenience places not where the files are actually required (i.e. reverse source and target)
Currently working on #4323 to remove the symlinks and hopefully resolve this issue. Please try that PR if you are still seeing this error.
I found the related bug #1909 and the solution there: #1909 (comment)
Basically:
rm deepspeed/ops/{csrc,op_builder} rm deepspeed/accelerator cp -R csrc op_builder deepspeed/ops/ cp -R accelerator deepspeed/
And all works as far as I can tell.
I'd suggest to not use symlinks in this repo at all which will avoid this issue in the first place.
Or if you have too for development ease: Do the other way round: Create symlinks to the convenience places not where the files are actually required (i.e. reverse source and target)
This work for me, thanks!
this did not fix it for me. on arch linux. still says error: can't copy 'deepspeed/accelerator'
: doesn't exist or not a regular file even though I deleted the symlinks and copied the directories over.
tried to compile with: HIP_PLATFORM="amd" DS_BUILD_CPU_ADAM=1 TORCH_HIP_ARCH_LIST="gfx1100" python setup.py install
spoke too soon. changing the symlinks for copying didn't work, but adding package_dir={"": "."}, did fix it
Describe the bug A clear and concise description of what the bug is.
To Reproduce Steps to reproduce the behavior: the official doc
bug
Expected behavior A clear and concise description of what you expected to happen.
ds_report output Please run
ds_report
to give us details about your setup.Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):