voidful / TextRL

Implementation of ChatGPT RLHF (Reinforcement Learning with Human Feedback) on any generation model in huggingface's transformer (blommz-176B/bloom/gpt/bart/T5/MetaICL)

MIT License

539 stars 60 forks source link

About the compare_sample #11

Closed jkwang93 closed 1 year ago

jkwang93 commented 1 year ago

非常感谢您提供的代码，不过当我修改compare_sample=3时，会报错

in _compute_explained_variance return float(1 - np.var(t - y) / vart)

ValueError: operands could not be broadcast together with shapes (2,3) (2,) 请问这该怎么处理呢。因为我需要多个compare_sample来更好地估计当前的状态，所以想将这个值修改更大，同时我也很想知道update_interval, minibatch_size这两个参数的作用。非常感谢

voidful commented 1 year ago

fixed _compute_explained_variance on commit @4004a584c8a5d38974837c191c5905b64c2b72ba update_interval 指多少step 更新 dataset replay buffer minibatch_size 指每次update policy 跟 value function 用多少data