wayveai / mile

PyTorch code for the paper "Model-Based Imitation Learning for Urban Driving".
MIT License
330 stars 31 forks source link

Online leaderboard result and question on comparsion results setting on other methods #4

Open Kin-Zhang opened 1 year ago

Kin-Zhang commented 1 year ago

Thanks for your work. I'm wondering if you push the agent to the online leaderboard. What is the score for the work?

For Table I in the paper, screenshot here: image

What's the version of the LAV and TransFuser etc weight you use in the paper? Did you train on your dataset or use them directly pre-trained weight that the author provided and what's the version on the weight file and config file also?

Since I noticed you use the 2.9M frames dataset on the MILE as the paper said.

Update: 2022/11/1, adding the discussion link here: https://github.com/Kin-Zhang/carla-expert/discussions/4

Kin-Zhang commented 1 year ago

The reason I proposed the question is that I noticed the dataset size for MILE is really large compared with Transfuser and LAV (like 10x larger), As I know that in [CVPR'21 Transfuser], [TPAMI'22 Transfuser], [CVPR'22 LAV], and [Arxiv'22 TCP] dataset sizes are all around 200-400K frames. [1M=1,000 K]

Just the same question as I asked on the InterFuse also here https://github.com/opendilab/InterFuser/issues/3: the large dataset size makes it unclear whether it's the model or the large data that brings the boost to the performance.

Kait0 commented 1 year ago

What's the version of the LAV and TransFuser etc weight you use in the paper?

I can answer your first question. It's neither, the numbers of TF, LBC, CILRS are copied from the CVPR TF paper (Town05 long benchmark). And the LAV number is copied from the LAV paper (LAV benchmark). They do mention that in their paper: For MILE, Roach, and the Expert, we report all the metrics defined in Section 4. For all the other methods we report the results from the original papers.

Since the numbers are from 3 different benchmarks (MILE is evaluated on their new benchmark) they are not really comparable. I emailed them about it and they said they will release an arxiv v2 where they will fix the problem.

Kin-Zhang commented 1 year ago

Thanks! @Kait0 I see.

Then, only one question now: How to prove/analyze whether it's the method (MILE) or the large data that brings the boost to the performance?

anthonyhu commented 1 year ago

Thanks @Kait0 for answering the first question! The updated version of the paper will be available tomorrow on arxiv.

As for the discussion around data:

If we compare the number of frames in the dataset, it seems that our dataset is larger than the other methods. When we look at the number of hours of driving data however we realise all the methods have roughly the same amount: for LAV (28hours or 400k at 4Hz), TransFuser (31hours or 220k at 2Hz), TCP (60 hours or 400k at 2Hz), and us (32 hours or 2.9M at 25Hz). We actually ran an ablation on the dataset size and found that performance did not change between 8 to 32 hours of data, and started degrading < 8hours. Therefore it's important to have a reasonably amount of data for the model to be able to generalise well, but beyond which that doesn't matter so much.

Kin-Zhang commented 1 year ago

I see. Thanks for the reply.

How about controlling the total frame num but not the hours? since in the training, most of the training steps will shuffle the dataset. Would you also try to control the total frames in the ablation study? I think just random select the fix total dataset is enough to peek the reason.

Or the question could be: why not also with 2Hz which will not let the dataset frames so big and have more different data scenarios, more same dataset may violate your point on generalizing?

And there is one more question which is how about the online leaderboard result for MILE?

anthonyhu commented 1 year ago

It is probably possible to train the same model with fewer frames by reducing the video frequency (25Hz -> 5Hz for example).

The online leaderboard is in preparation.

Kin-Zhang commented 1 year ago

I see. Please leave the issue here (Thanks, and we can wait to see if someone experimented with controlling the total frames.

Besides, here is the question mentioned above: why not also with 2Hz? since more similar dataset (25Hz collection) may violate your point on generalizing

Is there any reason MILE uses the 25Hz high frequency differently from others? And how to make sure such high collect frequency does not overfit those large datasets in similar frames? (generalization problem you mentioned)

anthonyhu commented 1 year ago

There is no particular reason for 25Hz, and the frequency can probably be set to something lower.

Kait0 commented 1 year ago

Their model is using 12 temporal frames during training (think it had a recurrent component inside). With such a model sub-sampling the training data (like all the single frame models do) might not be a good strategy as you need to increase the distance between the frames that the model sees at inference too.

Thought 25 Hz was a weird number as most works set the CARLA simulator frequency to 20 Hz (default in the leaderboard client). Might make it harder to compare to other work.

Kin-Zhang commented 1 year ago

LAV (28hours or 400k at 4Hz), TransFuser (31hours or 220k at 2Hz), TCP (60 hours or 400k at 2Hz), and us (32 hours or 2.9M at 25Hz). found that performance did not change between 8 to 32 hours of data, and started degrading < 8hours.

Therefore it's important to have a reasonably amount of data for the model to be able to generalise well, but beyond which that doesn't matter so much.

Yes, it's still weird for me. And I don't think the hour about the dataset can convince me. Especially, after the author's response on the hours, frames, and frequency. since more similar dataset (25Hz collection) may violate your point on generalizing. It just let me confuse more about how to make sure such high collection frequency does not overfit those large datasets in similar frames. (generalization problem you mentioned)

Maybe leave this question (issue) to future experiments. It's also can be a reminder. Let's see....

Anyway, Still, thanks for your code to the community and I believe it's easy for us to experiment with what we are confused about.

Kait0 commented 1 year ago

Maybe one interesting detail to point out is that in the paper they train for 50.000 iterations on batch size 64 (64x50000 -> 3M samples). Which would imply that the method was only trained for one epoch (maybe the authors can confirm).

If true this would be similar to training for 10 epochs with data stored at 2 FPS, just that instead of training on the same images multiple times you train on slightly augmented (in the sense of the vehicle moved a little bit) versions of them. So maybe training on a denser sampled data can be understood as a form of data augmentation.

anthonyhu commented 1 year ago

That's correct, the model was trained for a single epoch. And I agree with the subsequent analysis.

Kin-Zhang commented 1 year ago

That's correct, the model was trained for a single epoch. And I agree with the subsequent analysis.

I see. That makes sense. Thanks @Kait0 and @anthonyhu

I will (or maybe someone else who is interested) attach more ablation/ comparison results tables once I have time. (FLAG may never achieve, Hahahaha