motional / nuplan-devkit

The devkit of the nuPlan dataset.
https://www.nuplan.org
Other
673 stars 129 forks source link

Relationship between scene and scenarios #271

Open Fan-Yixuan opened 1 year ago

Fan-Yixuan commented 1 year ago

Dear motional developers,

My personal understanding is: each db file contains one log, which can be several minutes long, and each log is sliced into several scenes, each of which is 20 seconds long and has a goal_ego_pose and roadblock_ids.

My question is: What is the relationship between the 'scenario' and the 'log' and the 'scene'? I see that each log probably corresponds to a few or almost at most 20 or 30 scenes, but after I finish "cache", each log has many many scenarios; when I open some db files with sqlite browser, I see that the number of entries in scenario_tag even exceeds the number of entries in lidar_pc, how can I understand this.

Another small question is why we need to scale parameters like learning rate during ddp training. As I understand it, when we have the same effective batchsize (GPUs * bs_per_gpu for ddp vs. bs for single gpu training), we should use the same learning rate.

Thanks again for the great work and your patience.

patk-motional commented 1 year ago

Hi @Fan-Yixuan,

I've answered different flavours of this question before. Please refer to this previous answer https://github.com/motional/nuplan-devkit/issues/269#issuecomment-1512693196. Let me know if you have more questions.

As for your second question, I think my colleague @christopher-motional to answer this.

Fan-Yixuan commented 1 year ago

Hi @patk-motional, for my first question, thanks for your reference but I have some more questions: (1) the number of entries in scenario_tag exceeds the number of entries in lidar_pc, why is that?

(2) Also, when the start of a scenario is at a point near the end of a scene, the goal point (given by the scene, of course) may be close to the start point of the scenario (even closer than the farthest position we want to predict), which seems to affect the training.

(3) In my preliminary experiments, I used the Boston data as the training set (has 1647 logs (db files)), the val data as the validation set (1381 logs), after caching (with timestamp_threshold_s=5), I found 1020 folders (corresponding to logs) in the caching path, of which 510 are from the Boston set and val set respectively, why is that?

(4) Most critically, here is the profiling of a small but still somewhat performant model during training:

362870874 827742563

(4.1) Why fetching data for the val set is so slow (takes twenty times as long for network inference, although I'm sure I'm using cached data with use_cache_without_dataset: true). (4.2) The second question is, why logging for training is so slow? (In nuplan/planning/training/modeling/lightning_module_wrapper.py _log_step(), I added sync_dist=True for the logging commands to ensure correct logging for ddp).

Your kind help will mean a lot to me, THANKS!

christopher-motional commented 1 year ago

Hi @Fan-Yixuan,

Sorry for the delay. Regarding your question about parameter scaling, the effective batch size is actually increased when using ddp as the same specified batch size is used for each gpu, so we scale the learning rate accordingly.

christopher-motional commented 1 year ago

Regarding your follow-up questions,

  1. You can see some of the discussion @patk-motional linked to for a little more information, but a given lidar_pc can potentially be tagged with multiple scenario types resulting in the discrepancy you're seeing.
  2. Yes, we leave this up to you as a competitor on how to handle this. Note we don't intend to anything drastically different in the test set vs what is available for training.
  3. This could potentially be an effect of certain scenarios being filtered out depending on what configuration you are using in the scenario filter. Could you share what parameters you were using for the scenario filter?
  4. 4.1. How are you doing the caching (s3 vs local cache)? We are actually aware of an issue currently where pulling data from cache in s3 isn't working as expected, causing a lot of extra downloading. We are currently working on a fix for this. 4.2. We use a pretty simple wrapper of the base lightning module. You can check the lightning documentation for more details on how logging is handled, the nuplan wrapper shouldn't really be doing anything different. Per the documentation, using sync_dist=True does significantly increase the communication overhead though.
Fan-Yixuan commented 1 year ago

@christopher-motional thanks for your answers, my main question now is about question 3, my corresponding config is:

scenario_filter:
  scenario_types: null # List of scenario types to include
  scenario_tokens: null # List of scenario tokens to include

  log_names: null # Filter scenarios by log names
  map_names: null # Filter scenarios by map names

  num_scenarios_per_type: null # Number of scenarios per type
  limit_total_scenarios: null # Limit total scenarios (float = fraction, int = num) - this filter can be applied on top of num_scenarios_per_type
  timestamp_threshold_s: 5.0 # Filter scenarios to ensure scenarios have more than `timestamp_threshold_s` seconds between their initial lidar timestamps
  ego_displacement_minimum_m: null # Whether to remove scenarios where the ego moves less than a certain amount
  ego_start_speed_threshold: null # Limit to scenarios where the ego reaches a certain speed from below
  ego_stop_speed_threshold: null # Limit to scenarios where the ego reaches a certain speed from above
  speed_noise_tolerance: null # Value at or below which a speed change between two timepoints should be ignored as noise.

  expand_scenarios: false # Whether to expand multi-sample scenarios to multiple single-sample scenarios
  remove_invalid_goals: true # Whether to remove scenarios where the mission goal is invalid
  shuffle: true # Whether to shuffle the scenarios

Does this explain why my caching process discards most of the db files? thanks a lot

christopher-motional commented 1 year ago

Is this with distributed caching? If so could you compare if you use single vs multi-node?

muety commented 3 months ago

Thanks for the above discussion, very interesting. I still didn't entirely get the difference between scenes and scenarios, though, to be honest.

I understood that a scenario is always of a single type, while in a scene, there can be multiple different types. Thus, the exact same snapshot in time / same frame will probably be part of many scenarios on average, but always only belong to one scene, correct? So each time a certain tag "pops up", this is the starting point of a new scenario, until the tag "disappears" again / is not descriptive anymore for what's currently happening on the road?

What I'd like to understand is how the scene where sliced. Are they just randomly cut out of a whole log of driving or is there more to what a scene is actually supposed to represent?