Open Kin-Zhang opened 1 year ago
The reason I proposed the question is that I noticed the dataset size for MILE is really large compared with Transfuser and LAV (like 10x larger), As I know that in [CVPR'21 Transfuser], [TPAMI'22 Transfuser], [CVPR'22 LAV], and [Arxiv'22 TCP] dataset sizes are all around 200-400K frames. [1M=1,000 K]
Just the same question as I asked on the InterFuse also here https://github.com/opendilab/InterFuser/issues/3: the large dataset size makes it unclear whether it's the model or the large data that brings the boost to the performance.
What's the version of the LAV and TransFuser etc weight you use in the paper?
I can answer your first question.
It's neither, the numbers of TF, LBC, CILRS are copied from the CVPR TF paper (Town05 long benchmark).
And the LAV number is copied from the LAV paper (LAV benchmark).
They do mention that in their paper:
For MILE, Roach, and the Expert, we report all the metrics defined in Section 4. For all the other methods we report the results from the original papers.
Since the numbers are from 3 different benchmarks (MILE is evaluated on their new benchmark) they are not really comparable. I emailed them about it and they said they will release an arxiv v2 where they will fix the problem.
Thanks! @Kait0 I see.
Then, only one question now: How to prove/analyze whether it's the method (MILE) or the large data that brings the boost to the performance?
Thanks @Kait0 for answering the first question! The updated version of the paper will be available tomorrow on arxiv.
As for the discussion around data:
If we compare the number of frames in the dataset, it seems that our dataset is larger than the other methods. When we look at the number of hours of driving data however we realise all the methods have roughly the same amount: for LAV (28hours or 400k at 4Hz), TransFuser (31hours or 220k at 2Hz), TCP (60 hours or 400k at 2Hz), and us (32 hours or 2.9M at 25Hz). We actually ran an ablation on the dataset size and found that performance did not change between 8 to 32 hours of data, and started degrading < 8hours. Therefore it's important to have a reasonably amount of data for the model to be able to generalise well, but beyond which that doesn't matter so much.
I see. Thanks for the reply.
How about controlling the total frame num but not the hours? since in the training, most of the training steps will shuffle the dataset. Would you also try to control the total frames in the ablation study? I think just random select the fix total dataset is enough to peek the reason.
Or the question could be: why not also with 2Hz which will not let the dataset frames so big and have more different data scenarios, more same dataset may violate your point on generalizing?
And there is one more question which is how about the online leaderboard result for MILE?
It is probably possible to train the same model with fewer frames by reducing the video frequency (25Hz -> 5Hz for example).
The online leaderboard is in preparation.
I see. Please leave the issue here (Thanks, and we can wait to see if someone experimented with controlling the total frames.
Besides, here is the question mentioned above: why not also with 2Hz? since more similar dataset (25Hz collection) may violate your point on generalizing
Is there any reason MILE uses the 25Hz high frequency differently from others? And how to make sure such high collect frequency does not overfit those large datasets in similar frames? (generalization problem you mentioned)
There is no particular reason for 25Hz, and the frequency can probably be set to something lower.
Their model is using 12 temporal frames during training (think it had a recurrent component inside). With such a model sub-sampling the training data (like all the single frame models do) might not be a good strategy as you need to increase the distance between the frames that the model sees at inference too.
Thought 25 Hz was a weird number as most works set the CARLA simulator frequency to 20 Hz (default in the leaderboard client). Might make it harder to compare to other work.
LAV (28hours or 400k at 4Hz), TransFuser (31hours or 220k at 2Hz), TCP (60 hours or 400k at 2Hz), and us (32 hours or 2.9M at 25Hz). found that performance did not change between 8 to 32 hours of data, and started degrading < 8hours.
Therefore it's important to have a reasonably amount of data for the model to be able to generalise well, but beyond which that doesn't matter so much.
Yes, it's still weird for me. And I don't think the hour about the dataset can convince me. Especially, after the author's response on the hours, frames, and frequency. since more similar dataset (25Hz collection) may violate your point on generalizing. It just let me confuse more about how to make sure such high collection frequency does not overfit those large datasets in similar frames. (generalization problem you mentioned)
Maybe leave this question (issue) to future experiments. It's also can be a reminder. Let's see....
Anyway, Still, thanks for your code to the community and I believe it's easy for us to experiment with what we are confused about.
Maybe one interesting detail to point out is that in the paper they train for 50.000 iterations on batch size 64 (64x50000 -> 3M samples). Which would imply that the method was only trained for one epoch (maybe the authors can confirm).
If true this would be similar to training for 10 epochs with data stored at 2 FPS, just that instead of training on the same images multiple times you train on slightly augmented (in the sense of the vehicle moved a little bit) versions of them. So maybe training on a denser sampled data can be understood as a form of data augmentation.
That's correct, the model was trained for a single epoch. And I agree with the subsequent analysis.
That's correct, the model was trained for a single epoch. And I agree with the subsequent analysis.
I see. That makes sense. Thanks @Kait0 and @anthonyhu
I will (or maybe someone else who is interested) attach more ablation/ comparison results tables once I have time. (FLAG may never achieve, Hahahaha
Thanks for your work. I'm wondering if you push the agent to the online leaderboard. What is the score for the work?
For Table I in the paper, screenshot here:![image](https://user-images.githubusercontent.com/35365764/199226670-dbaed6fe-af49-4c85-ae39-6d63d4bf8172.png)
What's the version of the LAV and TransFuser etc weight you use in the paper? Did you train on your dataset or use them directly pre-trained weight that the author provided and what's the version on the weight file and config file also?
Since I noticed you use the 2.9M frames dataset on the MILE as the paper said.
Update: 2022/11/1, adding the discussion link here: https://github.com/Kin-Zhang/carla-expert/discussions/4