This PR fixes an issue introduced in this PR https://github.com/huggingface/alignment-handbook/pull/135 where "unused" columns would be removed. However, the definition of what "unused columns" are is not user-defined and therefore prone to unexpected side-effects. Moreover, this was only focused on instruction or DPO-like datasets, and did not account for pretraining datasets.
This PR makes sure that the user-defined text_column (that is used to collect the pretraining column) is not removed when run_cpt.py is used.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
This PR fixes an issue introduced in this PR https://github.com/huggingface/alignment-handbook/pull/135 where "unused" columns would be removed. However, the definition of what "unused columns" are is not user-defined and therefore prone to unexpected side-effects. Moreover, this was only focused on instruction or DPO-like datasets, and did not account for pretraining datasets.
This PR makes sure that the user-defined
text_column
(that is used to collect the pretraining column) is not removed whenrun_cpt.py
is used.