How does the preprocessing split scenarios in each db file?

CrisCloseTheDoor commented 1 year ago

Hi developers: I wonder how the preprocessing program splits the scenarios through each .db raw data file. There are two confusions:

In the cache dir, each scenario type in one log has several or even a dozen scenario tokens, but in unknown type, there are thousands of scenario tokens, Why are they far more numerous than a specified type? Are they actually the duplication of those scenarios in above types?
Each .db log can get thousands of scenarios if limit_total_scenarios=Null, do they intersect each other, or one by one temporally? (this is similar to question1). If the former, what's the principle of intersection? Can we change the setting to achieve one by one, in order to make the data more sparse while keeping scenario variation?

Thank you.

HiokHianOng commented 1 year ago

Hi,

Here are the clarifications regarding your the 2 questions:

The unknown scenario types are more numerous than the labelled scenario types due to the labelling frequency imposed. Scenarios that are unknown can be any of the scenario types, they are simply unlabelled. They are not duplicates of the scenarios, since each individual scenario has a different initial lidar timestamp. Additionally a single scenario be of multiple scenario types, and we are currently working on returning this the full set of scenario types for each scenario in the form of metadata to allow for more flexibility in training.
Yes, scenarios are recorded at 20Hz, so if limit_total_scenarios is not set, then scenarios will overlap each other temporally. In order to make the data more sparse while keeping scenario variation, we can set, for example, scenario_filter.timestamp_threshold_s=5.0 which will ensure that the initial timestamp of all scenarios within a scenario type are at least 5 seconds apart from each other.

CrisCloseTheDoor commented 1 year ago

Hi,

Here are the clarifications regarding your the 2 questions:

The unknown scenario types are more numerous than the labelled scenario types due to the labelling frequency imposed. Scenarios that are unknown can be any of the scenario types, they are simply unlabelled. They are not duplicates of the scenarios, since each individual scenario has a different initial lidar timestamp. Additionally a single scenario be of multiple scenario types, and we are currently working on returning this the full set of scenario types for each scenario in the form of metadata to allow for more flexibility in training.

Yes, scenarios are recorded at 20Hz, so if limit_total_scenarios is not set, then scenarios will overlap each other temporally. In order to make the data more sparse while keeping scenario variation, we can set, for example, scenario_filter.timestamp_threshold_s=5.0 which will ensure that the initial timestamp of all scenarios within a scenario type are at least 5 seconds apart from each other.

Thanks

motional / nuplan-devkit

How does the preprocessing split scenarios in each db file? #163