Closed rahular closed 5 months ago
Hi @rahular, thank you for reporting the issue. Unfortunately, I wasn't able to replicate the issue using the example scripts.
I think your data contains entries with token length shorter than 1024 (860 in this case). The example data in the example scripts are preprocessed to filter any entries with length less than 1024.
Can you try either:
max_length_enc
(default 1024) to something less than the shortest sequence length in your data.
orPlease let me know if the issue still remains!
Thank you for the response @MattYoon. I will try to use only data points that are greater than the max_enc_length
. Also, could you make DKYoon/slimpajama-200k
public, so that I can replicate your results?
I think the example data are all public. Please correct me if I'm wrong. https://huggingface.co/DKYoon https://huggingface.co/datasets/DKYoon/slimpajama-200k
Ah yes, I was looking at the wrong place. Thanks! I will run a training with your data and close this issue if I don't face problems.
Yes, the length of the inputs was the issue. Closing this now, thanks again!
Hi, thank you for the great work! I really like the idea and am trying to replicate it. However, while training the model with no outputs (
output_exists=False
), I am running into an index out-of-bounds error (both whenuse_dynamic_enc_length
isTrue
and `False).The stack trace is as follows:
Any pointers would be helpful. Thanks!