Closed anindya-saha closed 11 months ago
Hi @anindya-saha! Indeed, we need to change the dtype here to torch.unit8
.
Would you like to open a PR to fix that?
Thank You @regisss I have added more observations in the Expected behavior. Could you please review when you have a chance. The error is confirming the similar DDP issue as discussed in https://github.com/huggingface/optimum-habana/pull/318
I did try to raise the PR from my id https://github.com/anindya-saha/ but I get this error. How can I get the permission?
remote: Permission to huggingface/transformers.git denied to anindya-saha.
fatal: unable to access 'https://github.com/huggingface/transformers.git/': The requested URL returned error: 403
@anindya-saha For the PR it's expected, only internal collaborators can push branches directly into the Hugging Face organization. You should work from a fork of the repo and then you'll be able to push a PR :slightly_smiling_face:
Regarding your dataset issue, you are using the --streaming
argument right?
@regisss I raised a PR https://github.com/huggingface/transformers/pull/25665 from my fork. Please review when you have a chance.
To test locally, I made a quick change in the modeling_gpt_bigcode.py
file in my venv
. When I run the fine tuning after this change I get the following warning. Is this expected?
/home/devcloud/habanalabs-venv/lib/python3.8/site-packages/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py:180: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at /npu-stack/pytorch-fork/aten/src/ATen/native/TensorCompare.cpp:493.)
attn_weights = upcast_masked_softmax(attn_weights, attention_mask, mask_value, unscale, softmax_dtype)
@anindya-saha Sorry if I was not clear, I meant forking this repo (optimum-habana) and opening a PR modifying this file: https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py
The warning shouldn't be there, you should cast to bool
at some point, for example here for GPT-J: https://github.com/huggingface/optimum-habana/blob/47ff6cb1b37e8cf1550a3cc10f707da7bb3ecda8/optimum/habana/transformers/models/gptj/modeling_gptj.py#L87
But don't worry, I'll check the PR and will let you know if it is missing :slightly_smiling_face:
Thank You @regisss If I make changes only in the optimum-habana the error below still persists. I reverted the changes to the transformer
package and added the following changes in local venv
in the optimum-habana
folder
I made this PR https://github.com/huggingface/optimum-habana/pull/353 to see the equivalent changes I made locally .
Training...
Training...
terminate called after throwing an instance of 'c10::Error'
what(): Getting size for given data type is not supported: 0
Exception raised from getHCCLDataSize at /npu-stack/pytorch-integration/python_packages/habana_frameworks/torch/distributed/hccl/ProcessGroupHCCL.cpp:128 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f3a279b953c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7f3a2797f10c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x544ea (0x7f3a202d64ea in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/distributed/_hccl_C.so)
frame #3: habana_helpers::JobThread::threadFunction() + 0x128 (0x7f3a23dd4ae8 in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3a2b7dadf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3a2bab1609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3a2bbeb133 in /lib/x86_64-linux-gnu/libc.so.6)
Internal Error: Received signal - Aborted
terminate called after throwing an instance of 'c10::Error'
what(): Getting size for given data type is not supported: 0
Exception raised from getHCCLDataSize at /npu-stack/pytorch-integration/python_packages/habana_frameworks/torch/distributed/hccl/ProcessGroupHCCL.cpp:128 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fde6509553c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
The issue seems to be fixed with newer version of Torch.
System Info
Hello Team, We are trying to fine tune the
bigcode/starcoderbase-7b
model on a multi HPU (8 HPU) node and have been following the guidance https://github.com/huggingface/optimum-habana/tree/main/examples/language-modeling .However, we are encountering a similar issue that have been mentioned in the https://github.com/huggingface/optimum-habana/pull/318.
We are also using a custom
class ConstantLengthDataset(IterableDataset)
. Essentially we are trying to port the https://github.com/bigcode-project/starcoder/blob/main/finetune/finetune.py to habana and we usingfrom optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments
at appropriate places.Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
We should be able to complete the training loop without issues. We did try to add a fake
_len_
method inside theclass ConstantLengthDataset(IterableDataset)
class, but it still failed.But at the same time I see the following observations:
So, the issue arises whenever we shift to multliple HPU or distributed training on > 1 HPUs.
A Gaudi2 1 HPU = 96 GB of device memory.