Open gongjingcs opened 3 years ago
@gongjingcs. can you please share the results of running ds_report
in both working and failing installations?
@tjruwase working installations:
failing installations:
both ds_report results of working and failing installations are the same
@tjruwase would you please provide some tips for solving this problem ?
@gongjingcs, apologies for the delay on this. I will take a closer look today.
Hi @gongjingcs
I think you might have some incompatibility issue between the torch you installed and the one you are using. Also, the CUDA version you are using with Torch1.8 is lower than what torch is supporting based on their website. Could you please try resolving these and rerun the experiment?
@gongjingcs, did you try resolving the torch incompatibility?
@gongjingcs, did you try resolving the torch incompatibility?
hi, I tried resolving the torch incompatibility. I downgraded my torch version to 1.7.1, however it reports the same error
@tjruwase
@gongjingcs, got it thanks for confirming that it is not torch incompatibility issue. Will take a look.
@gongjingcs, can you provide the repro steps with the transformer kernel that triggers the problem?
@tjruwase ,of course.
step1:
DS_BUILD_OPS=1 python setup.py build_ext -j8 bdist_wheel
step2:
pip install dist/deepspeed-0.3.13+69bca4a-cp38-cp38-linux_x86_64.whl
step3:
till now, we have installed deepspeed successfully.
ds_report shows
step4:
run bing bert demo you provide with transformer kernel https://github.com/microsoft/DeepSpeedExamples/blob/bdf8e59aede8c8e0577e8d4d557298ca8515268f/bing_bert/ds_train_bert_bsz64k_seq128.sh
hi,I install deepspeed using the following command: DS_BUILD_OPS=1 python setup.py build_ext -j8 bdist_wheel pip install dist/deepspeed-0.3.13+69bca4a-cp38-cp38-linux_x86_64.whl
I ran a demo with transformer kernel, I met with the following error:
However if I install deepspeed from source, the demo works well