Questions about evaluation performance

sky77764 commented 4 months ago

Thank you for the active sharing of your valuable research.

I am deeply impressed by your research and tested a few things, but I have some questions.

I am curious about which configuration you used to obtain the DS values you reported. https://github.com/opendilab/LMDrive?tab=readme-ov-file#lmdrive-weights

Here are my configurations: leaderboard/teamcode/lmdrive_config.py

import os

class GlobalConfig:
    """base architecture configurations"""

    # Controller
    turn_KP = 1.25
    turn_KI = 0.75
    turn_KD = 0.3
    turn_n = 40  # buffer size

    speed_KP = 5.0
    speed_KI = 0.5
    speed_KD = 1.0
    speed_n = 40  # buffer size

    max_throttle = 0.75  # upper limit on throttle signal value in dataset
    brake_speed = 0.1  # desired speed below which brake is triggered
    brake_ratio = 1.1  # ratio of speed to desired speed at which brake is triggered
    clip_delta = 0.35  # maximum change in speed input to logitudinal controller

    llm_model = 'weights/llava-v1.5-7b'
    preception_model = 'memfuser_baseline_e1d3_return_feature'
    preception_model_ckpt = 'weights/LMDrive-vision-encoder-r50-v1.0/vision-encoder-r50.pth.tar'
    lmdrive_ckpt = 'weights/LMDrive-llava-v1.5-7b-v1.0/llava-v1.5-checkpoint.pth'

    agent_use_notice = False # True
    sample_rate = 2

    def __init__(self, **kwargs):
        for k, v in kwargs.items():
            setattr(self, k, v)

leaderboard/scripts/run_evaludation.sh

export ROUTES=langauto/benchmark_long.xml
export TEAM_AGENT=leaderboard/team_code/lmdriver_agent.py # agent
export TEAM_CONFIG=leaderboard/team_code/lmdriver_config.py # model checkpoint, not required for expert
export CHECKPOINT_ENDPOINT=results/sample_result.json # results file
#export SCENARIOS=leaderboard/data/scenarios/no_scenarios.json #town05_all_scenarios.json
export SCENARIOS=leaderboard/data/official/all_towns_traffic_scenarios_public.json

Are these configurations accurate? If these configurations are incorrect, then the following questions may not be necessary.

The results of the evaluation using above settings are as follows. However, a higher performance was achieved when agent_use_notice = False. What could be the reason for this? Also, the DS (LangAuto) reported by you was obtained from the "score_route" value. Does DS (LangAuto) not consider infraction penalties?

agent_use_notice = True

"scores": {
                "score_composed": 22.576320522257177,   
                "score_penalty": 0.7903388993546836,        
                "score_route": 27.799761292913487
},

agent_use_notice = False

"scores": {
                "score_composed": 29.6279197611852,
                "score_penalty": 0.8052244697640636,
                "score_route": 36.35275911099924
},

deepcs233 commented 4 months ago

Hi!

I have attached the log file for your reference. long_vicuna2_with_notice_t1.json

Could you run it with more times? the variance of the performance scores is large. Also, you can check your Carla version and the start command. If you still encounter the problem, I will check the agent code. Thanks for your attention.

TonyXuQAQ commented 3 months ago

Same here, I cannot reproduce reported results either. Could you please give me some suggestions for this? Thanks for your sincere help!

lmdrive_config.py

import os

class GlobalConfig:
    """base architecture configurations"""

    # Controller
    turn_KP = 1.25
    turn_KI = 0.75
    turn_KD = 0.3
    turn_n = 40  # buffer size

    speed_KP = 5.0
    speed_KI = 0.5
    speed_KD = 1.0
    speed_n = 40  # buffer size

    max_throttle = 0.75  # upper limit on throttle signal value in dataset
    brake_speed = 0.1  # desired speed below which brake is triggered
    brake_ratio = 1.1  # ratio of speed to desired speed at which brake is triggered
    clip_delta = 0.35  # maximum change in speed input to logitudinal controller

    llm_model = './huggingface/models--liuhaotian--llava-v1.5-7b/snapshots/12e054b30e8e061f423c7264bc97d4248232e965/'
    preception_model = 'memfuser_baseline_e1d3_return_feature'
    preception_model_ckpt = 'models/vision-encoder-r50.pth.tar'
    lmdrive_ckpt = 'models/llava-v1.5-checkpoint.pth'

    agent_use_notice = True
    sample_rate = 2

    def __init__(self, **kwargs):
        for k, v in kwargs.items():
            setattr(self, k, v)

run_evaluation.sh

export LEADERBOARD_ROOT=leaderboard
export CHALLENGE_TRACK_CODENAME=SENSORS
export PORT=$PT # same as the carla server port
export TM_PORT=$(($PT+500)) # port for traffic manager, required when spawning multiple servers/clients
export DEBUG_CHALLENGE=0
export REPETITIONS=1 # multiple evaluation runs
export ROUTES=langauto/benchmark_long.xml
export TEAM_AGENT=leaderboard/team_code/lmdriver_agent.py # agent
export TEAM_CONFIG=leaderboard/team_code/lmdriver_config.py # model checkpoint, not required for expert
export CHECKPOINT_ENDPOINT=results/sample_result.json # results file
#export SCENARIOS=leaderboard/data/scenarios/no_scenarios.json #town05_all_scenarios.json
export SCENARIOS=leaderboard/data/official/all_towns_traffic_scenarios_public.json
export SAVE_PATH=data/eval # path for saving episodes while evaluating
export RESUME=False

Final results, agent_use_notice=True

"scores": {
                "score_composed": 25.870979763631585,
                "score_penalty": 0.7629596874513518,
                "score_route": 35.450098903630824
            },

agent_use_notice=False

"scores": {
                "score_composed": 26.58510262333422,
                "score_penalty": 0.7749396906256292,
                "score_route": 35.97223471729377
            },

TonyXuQAQ commented 3 months ago

By the way, the results look similar after multiple times evaluation, the variance is acceptable.

MYZhangwww commented 1 month ago

I encountered the same issue. I tried to reproduce the experimental results using the official code and model weights on an RTX 4090. The CARLA version was 0.9.10.1, but the results I obtained were lower than those reported in the paper. I got DS=25.044, RC=36.526.

opendilab / LMDrive

Questions about evaluation performance #20