fix: prevent Chinese examples from being converted to Unicode encoding

coolmian commented 1 week ago

Using ensure_ascii=False provides better support for Chinese characters directly

before:

[[ ## json_output ## ]]
[{"type": "narration", "content": "\u5c0f\u660e\u8d70\u51fa\u5bb6\u95e8\uff0c\u8ddf\u90bb\u5c45\u6253\u62db\u547c"}, {"type": "dialogue", "name": "\u5c0f\u660e", "reaction": "\u9ad8\u5174", "content": "\u4f60\u597d\u5440"}, {"type": "narration", "content": "\u90bb\u5c45\u5fae\u7b11\u671d\u4ed6\u70b9\u5934"}, {"type": "voiceover", "name": "\u90bb\u5c45", "reaction": "\u5185\u5fc3\u5947\u602a", "content": "\u8fd9\u5c0f\u5b50\u4eca\u5929\u600e\u4e48\u5bf9\u6211\u8fd9\u4e48\u6709\u793c\u8c8c"}]

[[ ## completed ## ]]

after:

[[ ## json_output ## ]]
[{"type": "narration", "content": "小明走出家门，跟邻居打招呼"}, {"type": "dialogue", "name": "小明", "reaction": "高兴", "content": "你好呀"}, {"type": "narration", "content": "邻居微笑朝他点头"}, {"type": "voiceover", "name": "邻居", "reaction": "内心奇怪", "content": "这小子今天怎么对我这么有礼貌"}]

[[ ## completed ## ]]

coolmian commented 1 week ago

my case:

class Narrative(BaseModel):
    type: Literal["dialogue", "narration", "voiceover"] = Field()
    content: str = Field(default=None)
    name: str | None = Field(default=None)
    reaction: str | None = Field(default=None)

class StoryToJSON(dspy.Signature):
    """
    Convert story text into structured JSON format with specific fields for narration, dialogue, and voiceover.
    Make the performance more like a script or animation script style, help the performer better understand the character's emotions and reactions, and make the content more expressive and situational.
    NOTE: Convert each paragraph based on the story_text without skipping or omitting any content.
    """

    story_text = dspy.InputField()
    json_output: list[Narrative] = dspy.OutputField(desc="list of narratives")

# Define the predictor.
predictor = dspy.Predict(StoryToJSON)
example = dspy.Example(
    story_text = "小明走出家门，跟邻居打招呼：“你好呀”。邻居微笑朝他点头，内心奇怪这小子今天怎么对他这么有礼貌？",
    json_output = [
        {"type": "narration", "content": "小明走出家门，跟邻居打招呼"},
        {"type": "dialogue", "name": "小明", "reaction": "高兴", "content": "你好呀"},
        {"type": "narration", "content": "邻居微笑朝他点头"},
        {"type": "voiceover", "name":"邻居", "reaction": "内心奇怪", "content": "这小子今天怎么对我这么有礼貌"}
    ]
)

predictor.demos = [example]
with open("dataset/1.txt", "r") as f:
    story_text = f.read()

# Call the predictor on a particular input.
pred = predictor(story_text=story_text)
print(f"Question: {story_text}")
for item in pred.json_output:
    print(item.model_dump())

If examples containing Chinese strings are converted to Unicode encoding, the LLM tends to reply with Unicode encoded strings, resulting in a decrease in reply quality and additional decoding work

okhat commented 1 week ago

Thanks a lot @coolmian !

stanfordnlp / dspy

fix: prevent Chinese examples from being converted to Unicode encoding #1774