idealistic environment simulation doesn't work

onozuka777 commented 2 years ago

I am trying to work on simulation in very idealistic condition with using d3rkpy. My conditions: We have 1000customers. For customer we have 20 actions (0-19) and customer stage(0-5) We use these 20 actions and customer stages as observation parameter. Therefore, observation has 21 parameters. Action is performed twice a month, and considers 1year record, so records consist of 2 ×　12　×　1000 = 24000 records. When ・customer stage = 0 and act 1 then customer stage go up to 1 ・customer stage = 1 and act 2 then customer stage go up to 2 ・customer stage = 2 and act 3 then customer stage go up to 3 ・customer stage = 3 and act 4 then customer stage go up to 4 ・customer stage =4 and act 5 then customer stage go ap to 5 For each customer, customer stage doesn’t go down and remained unless shown action performed. Reward: For each customer’s last record, if the customer stage is 5 then reward is +1, if not so, then reward is -1. Action: As mentioned, action ranges 0-19, 20 actions. Terminal: For each customers’ last record, I give terminal sign 1 , not so, 0. I made a very idealistic data and random data. Random observation data’s action, customer stage, terminal values follows the series of rule. I made the idealistic records and random records with fixed rate: 5% random + 95% ideal 10% random + 90% ideal 15% random + 85% ideal . . 85% random + 15% ideal 90% random + 10% ideal 95% random + 5% ideal Default value is used for the hyper parameter. I predicted for each 24000 records, but even in 5% random data, we can’t get the idealistic prediction.

Ideal data example	Stage	act1	act2	act3	act4	act20
0	1	0	0	0	0	0(act1)
1	0	1	0	0	0	1(act2)
2	0	0	1	0	0	2(act3)
3	0	0	0	1	0	3(act4)
4	0	0	0	0	1	4(act5)

What kind of issue could be considerable? Thank you in advance.

pathway commented 2 years ago

I think this not related to d3rlpy, but your env sim.

I suggest you try your env sim in another RL library, does it work there?

(Besides that, your description is long and detailed. In general for bugs its best to distil the issue to the very smallest possible case.) Good luck

onozuka777 commented 2 years ago

Thank you for your suggestion. I am sorry for long writing. Actually, I have being engaged this work for about a year so that reinforcement learning can be applied to a real-world social task, like sales issue. At first, as you suggested, I had used Fitted-Q Network in SAS program using real sales data. But in action prediction data we observed many actions which are not so many appeared in learning dataset. On the research proses, this kind of task is characterized as off-line learning and I jumped to d3rlpy which has CQL algorithm. I learned in off-line learning, we need to take care of distribution shift, in which actions not so many appeared in learning data could be over-estimated in prediction. I applied d3rlpy for real sales data. Then next, in the forecast data, the frequency of the action appeared in proportion to the frequency of the action in the training data. So, I decided to perform a simulation. At first, I suspected that there was something wrong with the python program I had created, but I fixed the bug. But somehow, the simulation predicted a number of behaviors that almost never appear in the training set. My suspicion now is that conditions incompatible with reinforcement learning may have entered into this ideal way of simulation. I honestly have no idea what it is. Again, sorry for the long and bad writing. Something is wrong. If you can think of anything, I would appreciate your wisdom.

pathway commented 2 years ago

I suggest you find someone to consult to solve your specific problem. My contact details are in my github profile if you want to discuss with me.

As this is not related to d3rlpy, I suggest this issue be closed.

onozuka777 commented 2 years ago

Thank you. I have confirmed your E-mail address.(robin@pathwayi.com). Can I ask you in E-mail? If possible, then l will close this issue.

pathway commented 2 years ago

yes, thank you

takuseno / d3rlpy

idealistic environment simulation doesn't work #187