Error: Getting size for given data type is not supported while fine tuning starcoder model on optimum-habana

anindya-saha commented 1 year ago

System Info

Hello Team, We are trying to fine tune the bigcode/starcoderbase-7b model on a multi HPU (8 HPU) node and have been following the guidance https://github.com/huggingface/optimum-habana/tree/main/examples/language-modeling .

However, we are encountering a similar issue that have been mentioned in the https://github.com/huggingface/optimum-habana/pull/318.

We are also using a custom class ConstantLengthDataset(IterableDataset). Essentially we are trying to port the https://github.com/bigcode-project/starcoder/blob/main/finetune/finetune.py to habana and we using from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments at appropriate places.

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Training...
Training...
Training...
terminate called after throwing an instance of 'c10::Error'
  what():  Getting size for given data type is not supported: 0
Exception raised from getHCCLDataSize at /npu-stack/pytorch-integration/python_packages/habana_frameworks/torch/distributed/hccl/ProcessGroupHCCL.cpp:128 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7ff0b09bd53c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7ff0b098310c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x544ea (0x7ff0b02f84ea in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/distributed/_hccl_C.so)
frame #3: habana_helpers::JobThread::threadFunction() + 0x128 (0x7ff020da6ae8 in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #4: <unknown function> + 0xd6df4 (0x7ff0b47dedf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff0b4ab5609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff0b4bef133 in /lib/x86_64-linux-gnu/libc.so.6)

Internal Error: Received signal - Aborted
terminate called after throwing an instance of 'c10::Error'
  what():  Getting size for given data type is not supported: 0
Exception raised from getHCCLDataSize at /npu-stack/pytorch-integration/python_packages/habana_frameworks/torch/distributed/hccl/ProcessGroupHCCL.cpp:128 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f881daf453c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7f881daba10c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x544ea (0x7f88161634ea in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/distributed/_hccl_C.so)
frame #3: habana_helpers::JobThread::threadFunction() + 0x128 (0x7f8816ee3ae8 in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #4: <unknown function> + 0xd6df4 (0x7f8822904df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f8822bdb609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f8822d15133 in /lib/x86_64-linux-gnu/libc.so.6)

Internal Error: Received signal - Aborted
terminate called after throwing an instance of 'c10::Error'
  what():  Getting size for given data type is not supported: 0
Exception raised from getHCCLDataSize at /npu-stack/pytorch-integration/python_packages/habana_frameworks/torch/distributed/hccl/ProcessGroupHCCL.cpp:128 (most recent call first):
...

Internal Error: Received signal - Aborted
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node idc382 exited on signal 6 (Aborted).

Expected behavior

We should be able to complete the training loop without issues. We did try to add a fake _len_ method inside the class ConstantLengthDataset(IterableDataset) class, but it still failed.

def __len__(self):
    return 10

But at the same time I see the following observations:

We cannot run the starcoder-7B model on 1 HPU due to OOM
We can run the 3B model on 1 HPU, no issue with fetching dataset length
We cannot run the 3B model on 8 HPUs (infact > 1 HPU) , fails with the same getting size for data type issue.

So, the issue arises whenever we shift to multliple HPU or distributed training on > 1 HPUs.

A Gaudi2 1 HPU = 96 GB of device memory.

regisss commented 1 year ago

Hi @anindya-saha! Indeed, we need to change the dtype here to torch.unit8. Would you like to open a PR to fix that?

anindya-saha commented 1 year ago

Thank You @regisss I have added more observations in the Expected behavior. Could you please review when you have a chance. The error is confirming the similar DDP issue as discussed in https://github.com/huggingface/optimum-habana/pull/318

I did try to raise the PR from my id https://github.com/anindya-saha/ but I get this error. How can I get the permission?

remote: Permission to huggingface/transformers.git denied to anindya-saha.
fatal: unable to access 'https://github.com/huggingface/transformers.git/': The requested URL returned error: 403

regisss commented 1 year ago

@anindya-saha For the PR it's expected, only internal collaborators can push branches directly into the Hugging Face organization. You should work from a fork of the repo and then you'll be able to push a PR :slightly_smiling_face:

Regarding your dataset issue, you are using the --streaming argument right?

anindya-saha commented 1 year ago

@regisss I raised a PR https://github.com/huggingface/transformers/pull/25665 from my fork. Please review when you have a chance.

To test locally, I made a quick change in the modeling_gpt_bigcode.py file in my venv. When I run the fine tuning after this change I get the following warning. Is this expected?

/home/devcloud/habanalabs-venv/lib/python3.8/site-packages/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py:180: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at /npu-stack/pytorch-fork/aten/src/ATen/native/TensorCompare.cpp:493.)
  attn_weights = upcast_masked_softmax(attn_weights, attention_mask, mask_value, unscale, softmax_dtype)

regisss commented 1 year ago

@anindya-saha Sorry if I was not clear, I meant forking this repo (optimum-habana) and opening a PR modifying this file: https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py

The warning shouldn't be there, you should cast to bool at some point, for example here for GPT-J: https://github.com/huggingface/optimum-habana/blob/47ff6cb1b37e8cf1550a3cc10f707da7bb3ecda8/optimum/habana/transformers/models/gptj/modeling_gptj.py#L87

But don't worry, I'll check the PR and will let you know if it is missing :slightly_smiling_face:

anindya-saha commented 1 year ago

Thank You @regisss If I make changes only in the optimum-habana the error below still persists. I reverted the changes to the transformer package and added the following changes in local venv in the optimum-habana folder

I made this PR https://github.com/huggingface/optimum-habana/pull/353 to see the equivalent changes I made locally .

Training...
Training...
terminate called after throwing an instance of 'c10::Error'
  what():  Getting size for given data type is not supported: 0
Exception raised from getHCCLDataSize at /npu-stack/pytorch-integration/python_packages/habana_frameworks/torch/distributed/hccl/ProcessGroupHCCL.cpp:128 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f3a279b953c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7f3a2797f10c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x544ea (0x7f3a202d64ea in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/distributed/_hccl_C.so)
frame #3: habana_helpers::JobThread::threadFunction() + 0x128 (0x7f3a23dd4ae8 in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3a2b7dadf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3a2bab1609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3a2bbeb133 in /lib/x86_64-linux-gnu/libc.so.6)

Internal Error: Received signal - Aborted
terminate called after throwing an instance of 'c10::Error'
  what():  Getting size for given data type is not supported: 0
Exception raised from getHCCLDataSize at /npu-stack/pytorch-integration/python_packages/habana_frameworks/torch/distributed/hccl/ProcessGroupHCCL.cpp:128 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fde6509553c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)

regisss commented 11 months ago

The issue seems to be fixed with newer version of Torch.

huggingface / optimum-habana