jquesnelle / yarn

YaRN: Efficient Context Window Extension of Large Language Models
MIT License
1.25k stars 110 forks source link

dataset preprocessing script #28

Open mces89 opened 9 months ago

mces89 commented 9 months ago

Hi, can you also share the preprocessing script to convert the dataset to the standard format? also why the attention_mask in the dataset is required?

shossain commented 9 months ago

I am looking forward to the script for tokenization, too.