[BUG]: Reward prompt log is wrong due to the use of shared instance member across different coroutines

ProKil commented 9 months ago

Description of the bug

This line stores the reward prompt from the instance member -- evaluator.prompt which is updated in each __acall__. This is a dangerous operation since the prompt is lost after several parallel call to env.astep.

https://github.com/sotopia-lab/sotopia/blame/c4fdb166bab6f20ee541c48dd614981d38303b19/sotopia/envs/parallel.py#L564

Steps To Reproduce

In the current codebase shared in #7, you can find that in 66% of the reward prompts neither of the character names is mentioned.

@sharonwx54 contributed this script to reproduce:

import sys
import os
import json
from sotopia.database.persistent_profile import AgentProfile, EnvironmentProfile, RelationshipProfile
from sotopia.database.logs import EpisodeLog
from sotopia.database.env_agent_combo_storage import EnvAgentComboStorage
import pandas as pd

SELECTED_TAG = ['gpt-4_gpt-3.5-turbo_v0.0.1_clean', 'gpt-4_gpt-4_v0.0.1_clean']
selected_episodes = {}
for tag in SELECTED_TAG:
    tag_epis = EpisodeLog.find(EpisodeLog.tag == tag).all()
    if len(tag_epis) > 0:
        selected_episodes[tag]=tag_epis
concat_epilist = sum(selected_episodes.values(), [])

def check_msg_prompt_align(episode):
    party1 = episode.messages[0][0][1]
    party2 = episode.messages[0][1][1]
    if party1 and party2 not in episode.rewards_prompt:
        print(episode.pk)
    return

for episode in concat_epilist:
    check_msg_prompt_align(episode)

Additional Information

We can either

drop the reward prompts
find a coroutine-safe way to log them

XuhuiZhou commented 9 months ago

reward prompts are used to debug the process during the sync (i.e., when batch size =1) mode, so maybe we can drop the reward prompts during the async mode (batch size >=2)

XuhuiZhou commented 3 months ago

Okay, now I found this bug could influence more functions that are currently supported in the Sotopia.

basically, any function that uses:

async def aevaluate_one_episode(
    episode: EpisodeLog,
    model: str = "gpt-4",
    tag: str | None = None,
    push_to_db: bool = False,
) -> None:

Could be influenced. We somehow need to find a way to extract reward prompts safely cause this would be crucial for establishing robust evaluators. This could be relevant to #164 as well.

XuhuiZhou commented 3 months ago

@bugsz @ruiyiw here as well, and let us know whether #sotopia-better-eval is influenced by this?

bugsz commented 3 months ago

Upon checking I can confirm it is not affected as long as the original Sotopia episodes are not re-evaluated. Actually this only affect those that need re-evaluation.

sotopia-lab / sotopia