Questions about the VLMs-guided trajectory scorer

BritaryZhou commented 6 days ago

@jmwang0117 Thanks for such interesting and great work! I have some questions about the VLMs-guided trajectory scorer.

According to the paper, there are two stages of VLM-guided trajectory scorer. Could you please explain the inputs and outputs of these two stages? For stage 1, what's the Q and A of the "iterative dialogues with Llama 3.2V"? For stage 2, what's the "driving context" assessed by the model?
Whether the Llama 3.2V is finetuned?
If Llama 3.2V is not finetuned, why model hallucinations can be mitigated?

Thanks again for your reply~

jmwang0117 commented 5 days ago

Inputs and outputs of the two stages:

- Stage 1: The input is a curated dataset containing annotated surround images, descriptions of the current driving scene, motion states of surrounding agents, and the current driving style along with weight adjustment values. Through iterative dialogues with Llama 3.2V, the model assimilates contextual information to mitigate hallucinations. The output of this stage is the current dialogue environment, which is used as input for the next stage when new data is introduced to generate decisions.

- Stage 2: The input is the driving context assessed by the model, which includes the surrounding images at a given time and the dialogue environment from Stage 1. The model performs visual question answering (VQA) using the generated prompt templates from GPT-4o. The output is the adjustment values for the 7 weight parameters in the rule-based scorer, ranging from 1 to 3, to achieve personalized driving styles. For example, as shown in Figure 15 of our manuscript, when Llama 3.2V determines that the current driving style should be Level II aggressive, which tends to select faster trajectories and penalize slower ones, the model concludes that the speed weight should be increased by 1.0 to select faster trajectories.

jmwang0117 commented 5 days ago

We do not fine-tune Llama 3.2V. Instead, we create a dialogue history to prompt the model.

jmwang0117 commented 5 days ago

We mitigate model hallucinations in our approach for the following reasons:

In the first stage, we create a dialogue history through the manually annotated dataset and guide Llama 3.2V to focus on generating driving style adjustments and answers within the decision-making domain, following the specified answer templates. This approach not only reduces the computational overhead associated with fine-tuning but also ensures that the model produces responses relevant to the driving context.

Instead of directly asking Llama 3.2V to provide weight adjustments, we first let the model analyze the scene, detect surrounding agents, and then provide adjustment values within the specified range based on the model's judgment. This guided thinking chain approach, which has been validated in works like [1][2][3][4], can effectively reduce model hallucinations.

By avoiding fine-tuning and instead leveraging dialogue history and guided thinking, we significantly reduce the computational resources required while maintaining the model's ability to generate accurate and contextually relevant driving style adjustments.

[1] Chain of Natural Language Inference for Reducing Large Language Model Ungrounded Hallucinations [2] DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models [3] Drivelm: Driving with graph visual question answering [4] Lmdrive: Closed-loop end-to-end driving with large language models

BritaryZhou commented 3 days ago

Thank you so much for your quick reply and detailed explanations!~ :) I have two more questions as follows:

How much gain will the VLM-guided scorer bring for other none-diffusion-based planners? e.g. the planner from SparseDrive.
Which version of Llama 3.2V is utilized? e.g. Llama-3.2-11B-Vision

jmwang0117 commented 1 day ago

1. Efficiency Gain for Non-Diffusion-Based Planners like SparseDrive: When employing DDIM for inference, our VLM-guided scorer allows our planner to achieve higher frames per second (FPS) compared to VAD and Sparse Drive. The planner has around 20M parameters, highlighting its capability for real-time inference and potential for onboard vehicle deployment. In terms of performance, eliminating the scorer entirely, our HE-Drive achieves similar L2 distance and collision rates compared to VAD and UniAD. Additional advantages include our scorer being plug-and-play, facilitating its substitution for or use alongside other learning-based scorers.

2. Llama 3.2V Version Utilized: We have used the Llama-3.2-11B-Vision model. We will update our manuscript to clearly articulate this detail.

jmwang0117 / HE-Drive

Questions about the VLMs-guided trajectory scorer #6