Questions about the training and evaluation pipelines

Ji4chenLi commented 1 year ago

Hi Yunfan,

Thank you so much for the great work! Since I'm trying to reproduce the results, I would like to ask some questions regarding the training and evaluation details.

Can you provide the number of training epochs? (https://github.com/vimalabs/VIMA/issues/9)
Let's look at Table 7. Denote the number of gradient steps as $N{gs}$. Since you are using learning rate warm-up and cosine annealing, I assume the learning rate first increase linearly from 0 to 1e-4 when $N{gs} \in [0, 7K]$. When $N{gs}\in [7K + (2i 17K), 7k + ( (2i + 1) 17K)]$, the learning rate decrease from 1e-4 to 0. And when $N{gs}\in [7K + ((2i + 1) 17K), 7k + ( (2i + 2) 17K)]$, the learning rate increase from 0 to 1e-4. Am I right?

I notice that you fine-tune the last two layers and freeze all other layers of T5. Does it correspond to the following codes?

    for n, p in self.policy.t5_prompt_encoder.named_parameters():
        p.requires_grad = False
        if "t5.encoder.block.11.layer.1." in n or "final_layer_norm" in n:
            p.requires_grad = True

When calculating the success rate (SR) for each task distribution and level, how many task instances did you sample? I assume the equation you used is $$SR = \frac{\text{number of success}}{\text{number of total task instances}} $$
Can you share your vectorized implementation for the policy evaluation?
When evaluating the performance of your methods and the other baseline, how did you set the parameter hide_arm_rgb when making the env? Should we always set it to True?

Thanks and regards, Jiachen

yunfanjiang commented 1 year ago

Hi Jiachen,

Thanks for your interest in our project. To answer your questions:

We trained for 10 epochs in total.
The equation you wrote seems to be cyclical. We used a schedule that first linearly increases then monotonically decreases. A similar implementation can be found here.
We used "layer" to refer to transformer layer (block).
Yes, we computed success rate for each task averaged over 100 instances.
Our vectorized env implementation is based on this.
Yes, we used that to emulate workspaces that are free of robot arm occlusions.

Feel free to let me know if you have further questions.

Ji4chenLi commented 1 year ago

Thank you so much for your response! My questions now get mostly addressed. The questions I left are:

What's the min_lr when performing the cosine annealing? Is it 1e-5 per here and Chinchilla?
The Warmup Steps = 7K and the LR Cosine Annealing Steps = 17K as per Table 7. Could you let me know when the learning_rate decreases to min_lr? Is it step 17K or step 24K (7K + 17K)?

yunfanjiang commented 1 year ago

Thank you so much for your response! My questions now get mostly addressed. The questions I left are:

What's the min_lr when performing the cosine annealing? Is it 1e-5 per here and Chinchilla?

The Warmup Steps = 7K and the LR Cosine Annealing Steps = 17K as per Table 7. Could you let me know when the learning_rate decreases to min_lr? Is it step 17K or step 24K (7K + 17K)?

We used min_lr = 1e-7.
It decreased for 17K steps.

vimalabs / VIMA

Questions about the training and evaluation pipelines #16