rll-research / url_benchmark

MIT License
328 stars 50 forks source link

Task identification mechanism in APS #2

Closed junsu-kim97 closed 2 years ago

junsu-kim97 commented 2 years ago

Dear Misha Laskin,

I am very grateful for open-sourcing the well-written code.

It is really helpful for my research!

However, I have one question about the implementation of fine-tuning APS.

https://github.com/rll-research/url_benchmark/blob/710c3eb04e60ef559525bc90136ee4e1acae4c97/finetune.py#L196-L197

As shown in the code block in finetune.py, the task vector (named meta) is updated periodically "after" initial seed frames.

However, in the original paper of APS, it is said that the task vector is searched using initial seed frames, and is "fixed" during fine-tuning phase.

Therefore, I understand that the code should be revised as follows (the inequality sign is reversed): if self.global_step < ( init_step // repeat) and self.global_step % every == 0:

I wonder whether I miss something,

and I hope you provide some explanation about my question.

Best,

Junsu Kim

forhaoliu commented 2 years ago

Hi Kim, happy to answer the question. The APS was originally evaluated in Atari games, in which the downstream env steps are commonly 1e5. But in the URLB the downstream env steps are 2e6 because many included algos are finetuning based rather than identifying task and adapting to it style. In order to better utilize the downstream budget, we let the task vector adapt periodically.

junsu-kim97 commented 2 years ago

Thanks for the reply, Hao liu.

My question is addressed after reading your answer, and now I understand why the design choice of task vector adaptation differs from that in the original paper.

Best, Junsu Kim

junsu-kim97 commented 2 years ago

@lhao499

By the way, I have one more question about the implementation of APS.

For the task identification, the task vector is inferenced as follows:

https://github.com/rll-research/url_benchmark/blob/710c3eb04e60ef559525bc90136ee4e1acae4c97/agent/aps.py#L246

However, from the docs about torch.linalg.lstsq (https://pytorch.org/docs/stable/generated/torch.linalg.lstsq.html)

I understand that the implementation computes:

min{task} || reward * task - phi(s) ||{F}

However, I understand that it should be

min{task} || phi(s) * task - reward) ||{F}

So, I think that the code should be modified as:

task = torch.linalg.lstsq(rep, reward)[0][: rep.size(1), :][0]

I greatly appreciate it if you let me know whether I miss something.

Thank you!

forhaoliu commented 2 years ago

Hi @junsu-kim97 sorry about delayed response, the two equations you wrote are both correct in that task is a L2 normalized vector in APS. Task is normalized because of the use of VMF distribution as variational approximation. Hope this clarification helps.

junsu-kim97 commented 2 years ago

Thanks for the reply! Your answer really helps me a lot.