mmasana / FACIL

Framework for Analysis of Class-Incremental Learning with 12 state-of-the-art methods and 3 baselines.
https://arxiv.org/pdf/2010.15277.pdf
MIT License
512 stars 98 forks source link

Is task-agnostic accuracy correct? #8

Closed ashok-arjun closed 2 years ago

ashok-arjun commented 2 years ago

Hi,

I'm not sure if the task-agnostic accuracy is correctly calculated.

I see that in network.py line 44, you are creating a new Linear layer with nn.Linear(self.out_size, num_outputs) for each task.

This indicates that each task will have a separate Linear layer with each having equal capacity.

For task-agnostic accuracy, I see that the outputs of all heads are combined, then softmax is taken.

But in a lot of papers, they use only 1 head for all tasks, to indicate task-agnostic accuracy.

I am aware that when training, they only backpropagate the logits of the classes belonging to the current task, and what is done here is equivalent to that in a way.

May I please know why that is not followed here?

Thank you.

mkmenta commented 2 years ago

Hi, thanks for your question!

Having a single nn.Linear with all classes of all tasks and having multiple nn.Linear and then concatenating their outputs are equivalent implementations:

import torch

fcbig = torch.nn.Linear(64, 50)
fcsmall1 = torch.nn.Linear(64, 30)
fcsmall2 = torch.nn.Linear(64, 20)

# Make weight and bias values equal for both cases
with torch.no_grad():
         fcsmall1.weight.copy_(fcbig.weight[:30, :])
         fcsmall1.bias.copy_(fcbig.bias[:30])
         fcsmall2.weight.copy_(fcbig.weight[30:, :])
         fcsmall2.bias.copy_(fcbig.bias[30:])

# Forward
x = torch.rand(64)
y1 = fcbig(x)
y2 = torch.cat((fcsmall1(x), fcsmall2(x)))

print((y1 == y2).all())

You should see tensor(True) as the output when running this code.

As you say, many papers use the single nn.Linear implementation for all tasks from the beginning. However, in order to do that, you would need to know the number of classes of each task beforehand when beginning task 1. In other words: you would be having access to information from the "future". Therefore, breaking part of the idea of Continual Learning (although in practice both give the same results).

For this reason, we preferred to use the second implementation because it mimics a more realistic case: whenever a new task comes, the final layer is extended to allow learning the new classes. Without caring about how many tasks or classes will the model see in the future.

I hope I have made myself clear!

ashok-arjun commented 2 years ago

Thanks a lot for the detailed reply @mkmenta.

That definitely makes sense, and I understand why this implementation is much better and scalable.