huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.53k stars 393 forks source link

Fix dataloading for cpt #137

Closed BramVanroy closed 6 months ago

BramVanroy commented 6 months ago

This PR fixes an issue introduced in this PR https://github.com/huggingface/alignment-handbook/pull/135 where "unused" columns would be removed. However, the definition of what "unused columns" are is not user-defined and therefore prone to unexpected side-effects. Moreover, this was only focused on instruction or DPO-like datasets, and did not account for pretraining datasets.

This PR makes sure that the user-defined text_column (that is used to collect the pretraining column) is not removed when run_cpt.py is used.

HuggingFaceDocBuilderDev commented 6 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.