Closed kfertakis closed 10 months ago
Hi, kfertakis.
We updated our algorithm recently, but the dataset scripts have not been updated yet. The new dataset should contain propt_ids
, pretrain_ids
, and loss_mask
columns. I will update that part of the scripts this week. Thanks for your issue and commit.
please ref this pr: https://github.com/mindspore-lab/mindrlhf/pull/34
Hi,
The latest version of mindrlhf expects a different schema for the training dataset than the one downloaded with the
getTLDRMR.py
script. Previously, the following columns of the dataset were projected:columns_to_project = ["prompt_ids", "prompt_mask", "original_sample_ids", "original_sample_mask"]
However, now the code tries to extract the following:columns_to_project = ["prompt_ids", "pretrain_ids", "loss_mask"]
but thepretrain_ids
, andloss_mask
columns are not present in the dataset.Can you advise on how to resolve this and how to run the code end to end? Thanks
Commit: 5fb1273