polixir / NeoRL

Python interface for accessing the near real-world offline reinforcement learning (NeoRL) benchmark datasets
http://polixir.ai/research/neorl
Apache License 2.0
108 stars 12 forks source link

Question regarding the reward of sales promotion training dataset #10

Open britisony opened 8 months ago

britisony commented 8 months ago

Hi,

In the sales promotion environment the reward is denoted by rew = (d_total_gmv - d_total_cost)/self.num_users which means the operator observes one single reward signal over all users. However, in the offline training dataset the reward is different for each user across 50 days. For example refer to the below user orders and reward graph image image

as per my understanding the reward should be same each day for the three users and gradually increase over 50 days with increase in sales. Could you kindly let me know how the reward in the training dataset was calculated.

mzktbyjc2016 commented 8 months ago

Hi, the calculation of each user's reward is the same, which is the based on $gmv-cost$. However, as the platform policy should consider the overall/average incomes, the reward for each user is set as the average reward for simplicity.

Alternately, the datasets actions are made by a human operator (after data anonymization), and we retain the original reward in the dataset for researchers with specific needs.

britisony commented 8 months ago

Thank you for your reply. Could you kindly let me know how the original reward was calculated. I require it to recalculate the reward based on variations in user orders. Also, for the provided environment I noticed the val_initial_states = np.load(os.path.join(dir, f'test_initial_states_10000_people.npy')) is not reset which causes the self.states to take different values every time the environment is reset after initialization. As an effect even with a deterministic action the reward function grows every time the environment is reset. For example please refer to the below code and the reward graph image episode_rewards0 6

Could you let me know if this is a bug or if there is a reason behind the environment design choice.

mzktbyjc2016 commented 8 months ago

Thanks for reporting this issue. This environment is originally designed for online evaluation, and thus some code are tailored to evaluation but not for training. We have locally fixed this reset issue for training, while that branch has not committed. This will sooner come with the newer sales promotion environment with budget constraint.

Currently, you can revise this line with deepcopy() as a quick fix, i.e., "self.states = deepcopy(self.val_initial_states)" in https://github.com/polixir/NeoRL/blob/a4b6c578662a0566ac39bc6ed84236b853142e8f/neorl/neorl_envs/SalesPromotion/sales_promo/env/marketing.py#L439.

For "I require it to recalculate the reward based on variations in user orders", as mentioned above, the current sp_env does not support this calculation. You may need to use the raw order_number (the user network output) and gmv&cost data in https://github.com/polixir/NeoRL/blob/a4b6c578662a0566ac39bc6ed84236b853142e8f/neorl/neorl_envs/SalesPromotion/sales_promo/env/marketing.py#L161 and https://github.com/polixir/NeoRL/blob/a4b6c578662a0566ac39bc6ed84236b853142e8f/neorl/neorl_envs/SalesPromotion/sales_promo/env/marketing.py#L426 respectively.