opendilab / LMDrive

[CVPR 2024] LMDrive: Closed-Loop End-to-End Driving with Large Language Models
Apache License 2.0
526 stars 48 forks source link

Performance metrics in the paper #37

Open zacz08 opened 3 months ago

zacz08 commented 3 months ago

Hi, thanks for your nice work. I have a question about reproducing the driving score shown in the paper. I run the evaluation with the following configurations:

    preception_model = 'memfuser_baseline_e1d3_return_feature'
    preception_model_ckpt = '/LMDrive/ckpt/vision-encoder-r50.pth.tar'
    llm_model = '/LMDrive/llm_model/llava-v1.5-7b'
    lmdrive_ckpt = '/LMDrive/ckpt/llava-v1.5-checkpoint.pth'
    agent_use_notice = False
    sample_rate = 2

When I compare the result I reproduce (which is obtained from the result.json file), the "Avg. driving score" and "Avg. route completion" are lower than the metrics shown in the paper (the Avg. infraction penalty is the same). Were the values shown in the paper Table 2 also normalized by the driving distance which is the same as Table 4? or is there any possibility that I configured different than yours?

Thank you!

dingli-dean commented 3 months ago

Same issue with same setting (use the released model according to link in README and default script for evaluation). image

lion-zhang commented 3 months ago

@deepcs233 Thank you for the great work. Similar issue for the following settings:

lmdriver_config.py

    llm_model = '/mnt/data/HuggingFace/llava-v1.5-7b'
    preception_model = 'memfuser_baseline_e1d3_return_feature'
    preception_model_ckpt = '/mnt/data/HuggingFace/LMDrive-vision-encoder-r50-v1.0/vision-encoder-r50.pth.tar'
    lmdrive_ckpt = '/mnt/data/HuggingFace/LMDrive-llava-v1.5-7b-v1.0/llava-v1.5-checkpoint.pth'

    agent_use_notice = True
    sample_rate = 2

run_evaluation.sh

export ROUTES=/path/to/LMDrive/langauto/benchmark_tiny.xml
export SCENARIOS=${LEADERBOARD_ROOT}/data/official/all_towns_traffic_scenarios_public.json

Here, there is no LMDrive/leaderboard/data/LangAuto folder, so I use LMDrive/langauto/benchmark_tiny.xml instead.

The output from the terminal is: output.txt

The results are attached: sample_result.json

The performance is lower than that in the paper. Please advice since I do not know if I use the correct settings. Thank you in advance.

deepcs233 commented 1 month ago

Hi! Sorry for the late reply. I've been very busy recently. @lion-zhang Could you try use_notice = False? @zacz08 What's your reproduced results? If they are slightly lower than those reported in the paper, this is normal due to the high randomness in the CARLA simulator and end-to-end evaluations. To obtain more stable results, I recommend running the evaluations multiple times. @dingli-dean The result of LangAuto-Tiny looks good. Can you make sure that LangAuto-Short used the correct route file? The default route file is ROUTES=langauto/benchmark_long.xml, and LangAuto-Short should use benchmark_short.xml.

@dingli-dean @lion-zhang @zacz08 What kind of graphics card you used for the evaluation? I found the GPU may influence the performance. Graphics card that cannot be used for display (such as A100 80G, A100 40G, V100) looks like provide a more stable (high/ low) results.

zacz08 commented 1 month ago

Hi, @deepcs233, thanks for your reply.

Since my test platform is old (intel 9700k + 2080ti), I only reproduced the LongAuto-Short and LongAuto-Tiny benchmark once, with the result below. I found that the infraction penalty scores were not affected. LongAuto-Short: DS: 37.203, RC: 46.687, IS: 0.827 LongAuto-Tiny: DS: 56.292, RC: 68.402, IS: 0.844

I also run Carla server locally on the same machine. Could this also affect the result? By the way, I will test it a few more times as you said (with a better machine), and reply once I get the result.