ryanjulian / embed2learn

Embedding to Learn
9 stars 5 forks source link

Try: Sample tasks in inverse proportion to their completion rates #51

Closed ryanjulian closed 5 years ago

ghost commented 6 years ago

We could use the function choice from the numpy library. In our case, the number of tasks would be set in parameter a, and the probability of each task in parameter p. Let's consider a set of n tasks T=[t_0, t_1, ..., t_n], and the corresponding set of completion rates J=[j_0, j_1, ..., j_n]. To obtain the probability of each task, I was thinking we could first subtract one unit from each completion rate, so we can have a complementary completion rate F=1-J, and then we could divide F by the sum of all its elements to obtain the probability P, that is, P=F/∑F. For example, considering four tasks with the following completion rates: j_0 = 0.3, j_1 = 0.5, j_2 = 0.1, j_3 = 0.8. The complementary values would be: f_0 = 0.7, f_1 = 0.5, f_2 = 0.9, f_3 = 0.2. The summation of F would be 2.3, giving the following probabilities: p_0 = 0.304, p_1 = 0.217, p_2 = 0.391, p_3 = 0.087. In that way we can assign a higher probability to those tasks that have not been very successful to complete. For the implementation, I see that the current task selection strategies are found here, so I was wondering if I can add this new strategy there as well. Also, about the completion rates, my understanding is that every time an episode is successful, the completion rate increases. For example, if a task has been selected 30 times during the training, and only was successful during 23 episodes, it's completion rate would be 23/30=0.766. Now, to consider if an episode was completed or not, this depends on the accumulated rewards, that is, if the episode had an accumulated reward higher than a certain threshold, we could consider the episode as completed, right? In the code, I see that the tasks are obtained in the rollout function and then it's used to obtain the latent variable. If my definition of the completion rate is correct, we could just pass the threshold accumulated reward to the environment along with the task selection strategy and keep a variable with the accumulated reward to update a data structure with the completion rates of each tasks in the script for the MultiTaskEnv.