How to understand the results?

maawais commented 2 years ago

I have added 2 classes at each task for total of 5 tasks i.e. 5*2=10 total classes. Below are the results.

I have some confusion in understanding the results:

What is the difference between task-agnostic and task-aware accuracies?
The last row, of all the results in all 4 tables, presents the result of average accuracy after full training. So, this should be plotted for comparison with others e.g. image below is from LwM? For example, in case of 10 tasks for 100 classes, last row of the results from TAw Acc table should be plotted.

mmasana commented 2 years ago

Hi @maawais,

Let me see if I can help clear some confusion :)

What is the difference between task-agnostic and task-aware accuracies?

Task-aware accuracy is when you know the task at test time. This is the same setting as task-IL (you can check this survey) and in your scenario where there are only 2 classes per task, that would mean that random accuracy would be 50%, since you know the task and guess the class within. Task-agnostic accuracy is the more challenging setting, where you don't know the task at test time and have to guess among all learned classes so far (you can check the survey from this code, see Fig.1 and Sec. 2). Those two metrics are implemented here, in the calculate_metrics method. However, those are per-batch metrics. The global metrics are here, in the main script, where we store the metrics after each task is learned/trained.

The last row, of all the results in all 4 tables, presents the result of average accuracy after full training. So, this should be plotted for comparison with others e.g. image below is from LwM? For example, in case of 10 tasks for 100 classes, last row of the results from TAw Acc table should be plotted.

For each of the 4 metrics, the specific task metrics (2 accuracies, 2 forgettings) are provided. Each row shows the accuracy for each task learned so far (that's why the upper diagonal is filled with zeros, because those tasks do not exist at that time point). To compare with other methods, you first have to figure out if you want to compare on task-aware or task-agnostic, and then you can read the results in two classic ways. The first is to provide the accuracy per task or more commonly it's average accuracy (if the tasks have different number of classes, make sure you are printing the weighted average). And the second one, more similar to the graphs you show, is the average accuracy over time, which corresponds to the last column. This way, you plot the average total accuracy over all classes learned so far as an incremental time accuracy. The example from LwM that you provide is with task-aware accuracy (as far as I remember) with the second option explained.

Hope this helps!

maawais commented 2 years ago

Thank you for the detailed explanation. I knew that LwM is showing average accuracy in the plots, however, I was confused about getting this accuracy from the code. Now, I have got the answer. Thanks. Highly appreciate it.

maawais commented 2 years ago

The example from LwM that you provide is with task-aware accuracy (as far as I remember) with the second option explained.

I guess in LwM, top-1 accuracy is task-agnostic (in the results figure provided in the question above). Because test classes are not known at the test time and these test classes have to be guessed from all the learned classes at any given time.

mmasana commented 2 years ago

I guess in LwM, top-1 accuracy is task-agnostic (in the results figure provided in the question above). Because test classes are not known at the test time and these test classes have to be guessed from all the learned classes at any given time.

I just checked and it is task-aware (the results would be "too good" for task-agnostic). The test classes are guessed from the ones within the task which the class belongs, not all classes learned. In literature this is sometimes described as multi-headed, although I personally don't like that term due to the confusion it can generate. This is introduced already in page 2 here:

Task incremental (TI) methods: In this problem, a model trained to perform object classification on a specific dataset is incrementally trained to classify objects in a new dataset. A key characteristic of these experiments is that during evaluation, the final model is tested on different datasets (base and incrementally learned) separately. This is known as multi-headed evaluation [4]. In such an evaluation, the classes belonging to two different tasks have no chance to confuse with one another.

mmasana / FACIL

How to understand the results? #11