! Scenarios number for training in the full training dataset !

motional / nuplan-devkit

The devkit of the nuPlan dataset.

https://www.nuplan.org

Other

673 stars 129 forks source link

! Scenarios number for training in the full training dataset ! #269

Closed bulbcult closed 1 year ago

bulbcult commented 1 year ago

As indicated in the "Fact Sheet" of nuplan dataset: Data samples | ● Total data points (lidarpcs): ~95 million ● Total scenarios tagged: ~3 million

Regarding this, I have two questions:

Does it mean that there are 3 million scenarios in the full training dataset?
For training configuration, we need to choose "the number of scenarios to train with", namely, scenario_filter.limit_total_scenario, how should I choose this number? How should I determine this number? I previously attempted to use 1 million scenarios, but it caused my CPU to freeze. Currently, I am using 10,000 scenarios, but I am concerned this may be a small sample if the answer to my first question is positive.

Thank you for any helpful insights!

patk-motional commented 1 year ago

Hi @bulbcult,

The keyword here is "tagged". Of the 95 million data points, 3 million are tagged with scenario types e.g. near_multiple_vehicles and following_lane_with_lead. The rest are tagged with UNKNOWN. Therefore theoretically speaking you have access to 95 million data points to train on. The data is split into train/test/val sets. You should only train on the ~~test~~ train set.
I went into a bit of detail in this issue before. The takeaway is you should use scenario_filter.timestamp_threshold_s to make sure you are not getting samples that are too temporally close to each other. Here is our FAQ on this
(More like 2a.) 10,000 training samples is relatively small. From our experience, you should be able to hit at least 300,000+ data points to see some decent performance. To do this try caching the data first and training on the cache data. Check out the advanced training tutorial for more details.

bulbcult commented 1 year ago

Thank uuuuu! This is extremely helpful!

bAmpT commented 1 year ago

Hi @bulbcult,

The keyword here is "tagged". Of the 95 million data points, 3 million are tagged with scenario types e.g. near_multiple_vehicles and following_lane_with_lead. The rest are tagged with UNKNOWN. Therefore theoretically speaking you have access to 95 million data points to train on. The data is split into train/test/val sets. You should only train on the test set.

@patk-motional can you clarify why training should be done only on the 'test' set?

patk-motional commented 1 year ago

Hi @bAmpT,

We are simply following good ML practice by separating data into 3 sets: training set, validation set, and test set. We provide the dataset splits so that you won't have to split the data manually. By training only on just the ~~test~~ train set, you can avoid overfitting your model to the data. Taking from this link

Training Dataset: The sample of data used to fit the model. Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration. Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

bAmpT commented 1 year ago

Do you mean by evaluating the fitted model on the test set you can avoid overfitting the model to the data? As far as I know the test set is not used to train/fit the model to the data, but rather used to evaluate the performance of the model, thus ensuring the model is not overfitting to the train split.

patk-motional commented 1 year ago

Sorry, that's a typo. I meant train set.