kaistAI / LangBridge

[ACL 2024] LangBridge: Multilingual Reasoning Without Multilingual Supervision
https://arxiv.org/abs/2401.10695
51 stars 6 forks source link

Index out of bounds when `output_exists=False` #1

Closed rahular closed 5 months ago

rahular commented 5 months ago

Hi, thank you for the great work! I really like the idea and am trying to replicate it. However, while training the model with no outputs (output_exists=False), I am running into an index out-of-bounds error (both when use_dynamic_enc_length is True and `False).

The stack trace is as follows:

IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/rahul/anaconda3/envs/langbridge/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/rahul/anaconda3/envs/langbridge/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/rahul/LangBridge/python_scripts/train_langbridge.py", line 181, in collate_fn
    def collate_fn(batch): return self.collate_fn(
  File "/home/rahul/LangBridge/python_scripts/train_langbridge.py", line 116, in collate_fn
    offsets = enc_tokens['offset_mapping'][:, split_index][:, 0]
IndexError: index 1023 is out of bounds for dimension 1 with size 860

Any pointers would be helpful. Thanks!

MattYoon commented 5 months ago

Hi @rahular, thank you for reporting the issue. Unfortunately, I wasn't able to replicate the issue using the example scripts.

I think your data contains entries with token length shorter than 1024 (860 in this case). The example data in the example scripts are preprocessed to filter any entries with length less than 1024.

Can you try either:

  1. Setting max_length_enc (default 1024) to something less than the shortest sequence length in your data. or
  2. Filtering any entries with less than 1024 sequence length in your data.

Please let me know if the issue still remains!

rahular commented 5 months ago

Thank you for the response @MattYoon. I will try to use only data points that are greater than the max_enc_length. Also, could you make DKYoon/slimpajama-200k public, so that I can replicate your results?

MattYoon commented 5 months ago

I think the example data are all public. Please correct me if I'm wrong. https://huggingface.co/DKYoon https://huggingface.co/datasets/DKYoon/slimpajama-200k

rahular commented 5 months ago

Ah yes, I was looking at the wrong place. Thanks! I will run a training with your data and close this issue if I don't face problems.

rahular commented 5 months ago

Yes, the length of the inputs was the issue. Closing this now, thanks again!