rail-berkeley / serl

SERL: A Software Suite for Sample-Efficient Robotic Reinforcement Learning
https://serl-robot.github.io/
MIT License
323 stars 29 forks source link

what is the format of the dataset for training the binary reward classifier? #73

Closed lbsswu closed 1 month ago

lbsswu commented 1 month ago

what is the format of the dataset for training the binary reward classifier? can you provide some example pickle files?

charlesxu0124 commented 1 month ago

For simplicity, the classifier training code uses the same replay buffer and data format as the RL training code. So the pickle file should contain a list of transitions, where each transition is a dictionary with keys: observations, next_observations, actions, rewards, dones, and masks. Note that the observations and actions should match your environment's observation and action space, and only observations is used to train the classifier. You can refer to the record_demo.py script here for the correct format.

lbsswu commented 1 month ago

For simplicity, the classifier training code uses the same replay buffer and data format as the RL training code. So the pickle file should contain a list of transitions, where each transition is a dictionary with keys: observations, next_observations, actions, rewards, dones, and masks. Note that the observations and actions should match your environment's observation and action space, and only observations is used to train the classifier. You can refer to the record_demo.py script here for the correct format.

Thanks for your reply!

I'm confused about the positive and negative training data.

For the positive demonstration data here (https://github.com/rail-berkeley/serl/blob/21ff8a018d77ac8ee8505cfda11c567702ee70b0/examples/async_cable_route_drq/train_reward_classifier.py#L36), do you mean it include of only the last states of each successful trajectories? And the negative demonstration data is the remained states?

For instance, the task is picking up a cube, and there are 10 successful trajectories (episodes), and the length of each trajectory is 200, so we have 2000 states (observations). Furthermore, assume the task is done only at the last state (observation), so we have 10 positive data and 1990 negative data. Is this right?

jianlanluo commented 1 month ago

if you run the data collection script, it will let you set how many positives you want to collect, usually you want to cover a diverse states of both successful and failed ones; e.g., maybe 200 success and 1000 failures

lbsswu commented 1 month ago

if you run the data collection script, it will let you set how many positives you want to collect, usually you want to cover a diverse states of both successful and failed ones; e.g., maybe 200 success and 1000 failures

okay, thanks!