vana77 commented 5 years ago

Hi, I find that you use version0 under the folder datasets/labels, but when I download the CharadesEgo dataset I get the label of version1. What version do you use to get the result of the paper? Thanks.

gsig commented 5 years ago

Good question.

I looked into it and there seems to be a mistake in what version of the dataset was used where. This seems to stem from the fact that the egocentric test data was a separate parameter in the code such that when the new dataset (v1) was ready, and we reran all the methods for the camera ready, it was likely still being evaluated on the v0 version. This affects the "transfer learning" results (Table 3 in ActorObserver paper) and "egocentric baselines" results (Table 2 in CharadesEgo paper).

I'll try to outline what I've discovered below and how it will be clarified. However. Charades_v1 (the one on the website) should be used everywhere from now on, and any discrepancy with prior work noted where applicable.

Analysis:

Actor and Observer: Joint Modeling of First and Third-Person Videos

It looks like the numbers in Table 3 were run on Charades_v0. Rerunning these experiments on the following combinations of training and evaluation data I get:

CharadesEgo_v1_train (+Charades_v1 labels), evaluated on CharadesEgo_v1_test: 20% mAP
CharadesEgo_v1_train (+Charades_v1 labels), evaluated on CharadesEgo_v0_test: 28% mAP
CharadesEgo_v0_train (+Charades_v1 labels), evaluated on CharadesEgo_v0_test: 28% mAP

What this means: If comparing with the ActorObserver paper on CharadesEgo_v1 the 25.9% number in Table 3 is invalid, because it uses Charades_v0 for evaluation.

How it will be fixed: I'll recalculate the columns of Table 3 for Charades_v1 and release an Errata on the project webpage https://github.com/gsig/actor-observer

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

In Table 2 some of the baselines are using Charades_v0.

How it will be fixed: I'll release a new version of the Arxiv paper with Table 2 fixed.

Difference between Charades_v1 and Charades_v0

The most puzzling/unexpected thing about this is the difference between training and testing on different versions of the dataset. I did some preliminary analysis on the datasets to try to explain the difference in performance, but nothing seems to explain it:

Density of labels is similar
Label quality is the same, copying the new labels from CharadesEgo_v1 to the test split of v0 doesn't change anything
Training on CharadesEgo_v1 and testing on CharadesEgo_v0, isn't different from training on CharadesEgo_v0 and testing on CharadesEgo_v0, since all the class labels come from Charades_v1 anyway.
There doesn't seem to be a significantly different distribution of actions in CharadesEgo_v0_test and CharadesEgo_v1_test
I made new train/test splits for CharadesEgo_v1, both with and without user constraints, and it gives the same performance.

So in conclusion, Charades_v0 seems to have had a particularly "easy" train/test split for some reason, and it is not clear to me why that is, other than just random chance.

Moving forward, it should be sufficient to explain any discrepancy between your work and prior work by referring the the Errata in this respository or the updated Arxiv paper.

I'll keep posting updates about the process of clarifying this. Let me know if there is anything I can do to help, or if you have any questions. Also, if you (or anyone else) have any observations or insight into this, definitely let us know.

Best, Gunnar

vana77 commented 5 years ago

Thank you very much for your detailed reply.

lyttonhao commented 4 years ago

I am wondering if you already have the updated results for the corresponding tables somewhere? Thanks!

gsig / actor-observer

What's the version of dataset? #7

Analysis:

Actor and Observer: Joint Modeling of First and Third-Person Videos

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

Difference between Charades_v1 and Charades_v0