Implementation details of the baseline in the leaderboard.

jchengai commented 1 year ago

Hi motional teams, I wonder if you could revel the implementation details of the baseline (UrbanDriver) in the leaderboard, e.g., model architecture, training config, dataset split/augmentation.

patk-motional commented 1 year ago

Hi @jchengai,

We will share that in our documentation in the next release. In the mean time you can find the implementation here

bhyang commented 1 year ago

Hi @patk-motional,

Is the provided model code and configuration the same ones used in the reported UrbanDriver baseline, or are there other changes? I tried training UrbanDriver with ~250K samples and the performance was lower than the IDM policy for closed-loop reactive planning, but I'm not sure if the performance disparity is solely from the dataset size.

If the details aren't available until the next release, is there an ETA for when that might be ready?

Thanks!

patk-motional commented 1 year ago

Hi @bhyang,

Let me connect you with @christopher-motional who implemented and train the baseline model. He is on leave at the moment. I'll get him to reply as soon as he is back next week.

christopher-motional commented 1 year ago

Hi @bhyang, sorry for the delayed response. Yes, the reported baseline was trained using the available model code with close to the same configuration you will find in the available config files. I believe the only deviations were using the AdamW optimizer with a slightly different learning rate from default (I believe 1.25e-5 vs 5e-5) along with the OneCycleLr learning rate scheduler. Data augmentation was an important part of this, but that should be the same as what you see in the training config.

The baseline was trained on the full trainval dataset, subsampled at a rate of 0.1 (around 300K samples I believe). For this baseline, the IDM policy did actually generally slightly outperform the ml model when evaluated in closed loop with reactive agents -- depending on how much of a disparity you're seeing, that is somewhat expected.

bhyang commented 1 year ago

Hi @christopher-motional, thanks for the clarification! I have a few follow-up questions:

On the eval.ai leaderboard it says UrbanDriver gets 0.80 for ch3_overall_score (which I believe is the closed-loop reactive evaluation), but the IDMPlanner gets 0.73. Am I interpreting these results incorrectly, or did UrbanDriver do worse than the IDM policy on some validation scenarios, but better than the IDM policy on the actual test scenarios?
Related to the above -- is there a recommended simulation config we should use for validating our methods? I've just been using scenario_filter=all_scenarios and scenario_filter.num_scenarios_per_type=2 (and +simulation=closed_loop_reactive_agents)
Did you use timestamp thresholding (timestamp_threshold_s) when dumping the dataset? For me setting scenario_filter.limit_total_scenarios=0.1 results in around 2 million samples still

Appreciate the help, thanks!

christopher-motional commented 1 year ago

Sorry yes, in evaluating across all scenario types I believe it generally did slightly worse. Though specifically for challenge 3, that might have been evaluated on a smaller subset of scenario types/number. Let me check on that and get back to you.
To be reflective of what's being used for the competition, I would suggest setting scenario_builder=nuplan_challenge and scenario_filter=nuplan_challenge_scenarios. Only 2 scenarios per type is quite small and I think would involve too much variance for effective evaluation. I would suggest bumping this up depending on how long you're willing to wait for simulation
For the baseline, only scenario_filter.limit_total_scenarios=0.1 was set. Sorry, was thinking that referred to the roughly 3 million scenarios being culled down to 300K, but that actually might include UNKNOWN types as well so that might actually be the accurate number. In any case, it was trained on the full trainval set with setting scenario_filter.limit_total_scenarios=0.1

christopher-motional commented 1 year ago

Just as a quick follow up, as I was saying, the values your see reported for the warm-up phase is reflective of the fact our evaluation for this phase was done on a smaller subset of data with reduced number of scenario types. Evaluation for the test-phase will be on a larger amount of data and will not be skewed in this manner.

bhyang commented 1 year ago

@christopher-motional What was the effective batch size used? Also how long did training take approximately (both number of epochs and wall clock time)? Thanks!

christopher-motional commented 1 year ago

The effective batch size was 256 and we trained around 50 epochs taking around 2 days from what I remember. For what it's worth, the baseline really is more of a reference point to get people started and serves as a base comparison point. If you look at how feature extraction/data augmentation is done for this model in the devkit, you should see a number of things that could be done more efficiently, which we encourage competitors to improve to effectively train their models.

rossgreer commented 1 year ago

I see that in the tutorial, 'scenario_filter.limit_total_scenarios=500'.

When you pass 0.1 instead, does this mean that the program samples this proportion of all scenarios?
If this is the case, would 1 be equivalent to 1 scenario, or "all scenarios"?
I'm having some trouble finding the default values for some configuration elements, for example, num_scenarios_per_type. Do you know which file has the configuration value that this draws from (or, where it is used in the scripts)? Similarly, I made a small StackOverflow post with my confusion around configuration files, any advice on this is greatly appreciated: https://stackoverflow.com/questions/76185683/where-does-hydra-find-its-keys-in-python

Thanks in advance, reading through the Issues discussion has been very helpful!

patk-motional commented 1 year ago

Hi @rossgreer,

Answering your questions in the same order:

If it's a float then it will sample proportional to all available scenarios in the dataset. If you pass an int it will be the total scenarios.
As mentioned in the previous answer, passing 1 as an int will return one scenario. If you want all scenarios try passing null instead. This will disable the filter entirely.
All the files are relative to the CONFIGPATH. The defaults are generally in the `default*.yaml. Answering your question from Stackoverflow.training_vector_model.yaml` does exist. https://github.com/motional/nuplan-devkit/blob/ce3c323af01c0d7ec5672f7832ef53f9c679aab0/nuplan/planning/script/experiments/training/training_vector_model.yaml

motional / nuplan-devkit

Implementation details of the baseline in the leaderboard. #256