mindspore-lab / mindrlhf

Apache License 2.0
26 stars 12 forks source link

Training dataset schema issue #30

Closed kfertakis closed 10 months ago

kfertakis commented 11 months ago

Hi,

The latest version of mindrlhf expects a different schema for the training dataset than the one downloaded with the getTLDRMR.py script. Previously, the following columns of the dataset were projected: columns_to_project = ["prompt_ids", "prompt_mask", "original_sample_ids", "original_sample_mask"] However, now the code tries to extract the following: columns_to_project = ["prompt_ids", "pretrain_ids", "loss_mask"] but the pretrain_ids, and loss_mask columns are not present in the dataset.

Can you advise on how to resolve this and how to run the code end to end? Thanks

Commit: 5fb1273

KerryKou commented 11 months ago

Hi, kfertakis.

We updated our algorithm recently, but the dataset scripts have not been updated yet. The new dataset should contain propt_ids, pretrain_ids, and loss_mask columns. I will update that part of the scripts this week. Thanks for your issue and commit.

ChessQian commented 11 months ago

please ref this pr: https://github.com/mindspore-lab/mindrlhf/pull/34