google-research / l2p

Learning to Prompt (L2P) for Continual Learning @ CVPR22 and DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning @ ECCV22
https://arxiv.org/pdf/2112.08654.pdf
Apache License 2.0
404 stars 41 forks source link

Inference #28

Closed YUZ128pitt closed 1 year ago

YUZ128pitt commented 1 year ago

Hi, thanks for the interesting work.

I have one question regarding the choice of prompt during the testing.

It seems that both DualPrompt and L2P use batch mode during testing, and each batch will choose the same prompt using majority voting. However, assuming that every test sample per batch is from the same task is questionable. Do you have any thoughts about this? Looking forward to hearing from you!

Bests, Yuansheng

gqk commented 1 year ago

@YUZ128pitt Hi, I notice this issue too, have you tested the code with batchwise_prompt=False?

YUZ128pitt commented 1 year ago

@YUZ128pitt Hi, I notice this issue too, have you tested the code with batchwise_prompt=False?

Hi, thanks for sharing this information.

I do not test the "batchwise_prompt=False" because I cannot run the code in my machine. However, I do implement this trick in my current work, and the results show that the CIL accuracy could be close to TIL accuracy when it equipped with such trick, ie, 0.95 acc, which is better than the upper bound (joint training). This trick contributes more than 20% acc gain in my work. I think the reason is that it reduce the risk of misclassifying task id.

While the results is encouraging, I a little bit concerned that assuming a batch data are from the same task is not realistic in continuous learning scenario. I wonder how you think of this issue, looking forward for you response! Thanks!

gqk commented 1 year ago

Hi, thanks for reply!

I'm afraid the assumption is not reasonable in continual learning, but training with batch wise prompt and evaluating with instance wise prompt is acceptable.

In my PyTorch reproduction code, L2P result on CIFAR100 benchmark drops from 82.05% to 81.29% when batchwise_prompt=False in evaluation, that make sense.

KingSpencer commented 1 year ago

Hi,

Thanks for the great catch!

Yes I believe I should make this trick clear in the implementation. As @gqk mentioned, the assumption is only reasonable if we believe that the example of the batches are close enough to select the same ids, at test time. However, it really depends on how the test setting is and in principal we cannot make this assumption if there is no task boundaries in test. But I think a more reasonable usage of this trick is that it can be a hyperparameter to search on the validation set in practice (when there is no prior knowledge about your dataset).

Best, Zifeng

YUZ128pitt commented 1 year ago

Hi, thanks for reply!

I'm afraid the assumption is not reasonable in continual learning, but training with batch wise prompt and evaluating with instance wise prompt is acceptable.

In my PyTorch reproduction code, L2P result on CIFAR100 benchmark drops from 82.05% to 81.29% when batchwise_prompt=False in evaluation, that make sense.

Hi, thank you so much for sharing your results.

First, I totally agree with the point that you made: training with batch wise prompt and evaluating with instance wise prompt.

Second, regarding you results, to be honest, I am a little surprised by the fact that the acc only drops new little. For the results, I wonder if it is because that setting "batchwise_prompt=False" can NOT do instance-wise prompt selection. At least , I do look carefully into code, and the official implementation does not achieve that.

There is a sample way to test this assumption: we could shuffle the test data in task level, which means that each batch includes data from different task. If the prompt selection is do instance wise, the test results should not be affected. Many thanks if you could let me know the results!!

Bests, Yuansheng

YUZ128pitt commented 1 year ago

Hi Zifeng,

Thanks for your reply. Ps, I was trying to reach out to you at the ECCV conference(^-^).

In principle, I agree that we cannot make this assumption if there are no task boundaries in the test. Moreover, I think that is the most realistic class incremental testing scenario. This trick significantly reduces the error of identifying the task id for the test sample though I do not use any hyper-parameters.

Both l2p and Dualprompt use the nearest-class-mean classifier for task id identification in the embedding space(which is not trained with downstream tasks, which avoids the forgetting issue.). I think the nearest-class-mean classifier will suffer from two aspects: 1) each task, which includes multiple classes, is represented by a single prototype. and (2) the embedding function is not trained with downstream data. I found an attempt in ICLR2023 to address this first issue: https://openreview.net/forum?id=BSww-NrOzJ.

While I have some concerns about the inference, I enjoyed this work, appreciate the idea, and learned much from it.

Bests, Yuansheng

prachigarg23 commented 1 year ago

Hi @YUZ128pitt , @KingSpencer, So in my current work it seems that not using batchwise prompts yeilds better results somehow, and thanks for your discussion on task-boundaries, etc.

I have one question. You mention:

Both l2p and Dualprompt use the nearest-class-mean classifier for task id identification in the embedding space(which is not trained with downstream tasks, which avoids the forgetting issue.)

Where is this mentioned in the papers or included in the code? What I understand is that during training the loss is weighted such that its 1 only for current task classes and 0 otherwise. But I don't understand where nearest-class-mean is being used for classification.

Can anyone shed insight into how the previous and new classes are being demarcated during training and evaluation in L2P? This detail is crucial for my experiments, any help will be appreciated.

YUZ128pitt commented 1 year ago

Hi,

The batch wise prompts yield performance boost when the local stationary assumptions hold, i.e., each test batch is from the same task. When the assumption does not hold, which is a more realistic scenario, the batch wise prompts hurt the performance.

For the nearest-mean-classifier, I don't think the paper mentioned it. However, I feel the key-value search essentially equals to a nearest-mean-classifier.

Thanks

lvv0 commented 1 year ago

@YUZ128pitt Hi, I notice this issue too, have you tested the code with batchwise_prompt=False?

Hi, thanks for sharing this information.

I do not test the "batchwise_prompt=False" because I cannot run the code in my machine. However, I do implement this trick >in my current work, and the results show that the CIL accuracy could be close to TIL accuracy when it equipped with such trick, >ie, 0.95 acc, which is better than the upper bound (joint training). This trick contributes more than 20% acc gain in my work. I >think the reason is that it reduce the risk of misclassifying task id.

While the results is encouraging, I a little bit concerned that assuming a batch data are from the same task is not realistic in continuous learning scenario. I wonder how you think of this issue, looking forward for you response! Thanks!

@YUZ128pitt @KingSpencer @gqk I notice this issue too, it seem unreasonable, but it can achieve a high accuracy. So I want to know if I can use this trick in my work as well? hahaha~

libo-huang commented 3 weeks ago

Hi, thanks for reply!

I'm afraid the assumption is not reasonable in continual learning, but training with batch wise prompt and evaluating with instance wise prompt is acceptable.

In my PyTorch reproduction code, L2P result on CIFAR100 benchmark drops from 82.05% to 81.29% when batchwise_prompt=False in evaluation, that make sense.

Great, I agree with you!