Closed bulbcult closed 1 year ago
Hi @bulbcult,
near_multiple_vehicles
and following_lane_with_lead
. The rest are tagged with UNKNOWN
. Therefore theoretically speaking you have access to 95 million data points to train on. The data is split into train/test/val sets. You should only train on the scenario_filter.timestamp_threshold_s
to make sure you are not getting samples that are too temporally close to each other. Here is our FAQ on thisThank uuuuu! This is extremely helpful!
Hi @bulbcult,
- The keyword here is "tagged". Of the 95 million data points, 3 million are tagged with scenario types e.g.
near_multiple_vehicles
andfollowing_lane_with_lead
. The rest are tagged withUNKNOWN
. Therefore theoretically speaking you have access to 95 million data points to train on. The data is split into train/test/val sets. You should only train on the test set.
@patk-motional can you clarify why training should be done only on the 'test' set?
Hi @bAmpT,
We are simply following good ML practice by separating data into 3 sets: training set, validation set, and test set. We provide the dataset splits so that you won't have to split the data manually. By training only on just the test train set, you can avoid overfitting your model to the data. Taking from this link
Training Dataset: The sample of data used to fit the model. Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration. Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.
Do you mean by evaluating the fitted model on the test set you can avoid overfitting the model to the data? As far as I know the test set is not used to train/fit the model to the data, but rather used to evaluate the performance of the model, thus ensuring the model is not overfitting to the train split.
Sorry, that's a typo. I meant train set.
As indicated in the "Fact Sheet" of nuplan dataset: Data samples | ● Total data points (lidarpcs): ~95 million ● Total scenarios tagged: ~3 million
Regarding this, I have two questions:
Does it mean that there are 3 million scenarios in the full training dataset?
For training configuration, we need to choose "the number of scenarios to train with", namely,
scenario_filter.limit_total_scenario
, how should I choose this number? How should I determine this number? I previously attempted to use 1 million scenarios, but it caused my CPU to freeze. Currently, I am using 10,000 scenarios, but I am concerned this may be a small sample if the answer to my first question is positive.Thank you for any helpful insights!