princeton-nlp / LESS

[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
MIT License
307 stars 25 forks source link

a small mistake in the code #15

Closed ZZZZZccccc123 closed 2 months ago

ZZZZZccccc123 commented 3 months ago

thanks for sharing your code, https://github.com/princeton-nlp/LESS/blob/main/less/data_selection/write_selected_data.py#L76 In this code version, a small mistake made sorted.csv goes wrong, to make it correct, line 76 and line 77 should exchange their position

ZZZZZccccc123 commented 3 months ago

After I use the selected data to train the model, the embedding layer sometimes will goes wrong: copying a param with shape torch.Size([32001, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]) many thanks for you reply!

vanduc103 commented 3 months ago

thanks for sharing your code, https://github.com/princeton-nlp/LESS/blob/main/less/data_selection/write_selected_data.py#L76 In this code version, a small mistake made sorted.csv goes wrong, to make it correct, line 76 and line 77 should exchange their position

I agree. From the original code, indices will be duplicated in the data_from list.

xiamengzhou commented 3 months ago

@ZZZZZccccc123 @vanduc103 Thanks for spotting the issue! Just fixed it. It should not affect the selected data though.

After I use the selected data to train the model, the embedding layer sometimes will goes wrong: copying a param with shape torch.Size([32001, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096])

It looks like you're trying to continue training a model that includes an extra token (likely padding) with a model that doesn't have this extra token. To resolve this, you either need to add a padding token to the current model or avoid continuing the training with this setup. Hope this helps!