openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.06k stars 749 forks source link

GPT4o出现低级bug:发现最新token中的垃圾语料及实测GPT4o胡言乱语出现幻觉 #297

Closed alexhmyang closed 1 month ago

alexhmyang commented 1 month ago

GPT4o出现低级bug:发现最新token中的垃圾语料及实测GPT4o胡言乱语出现幻觉

微信截图_20240517140555

微信截图_20240517140604

微信截图_20240517140621

微信截图_20240517140405

微信截图_20240517113026

比如:词表里有一个垃圾词是“微信公众号天天中彩票”, 你只要在 gpt4o官网输入: 微信公众号天天中彩票 是什么意思 后,他就会胡言乱语了,比如他回答:【 “微信娱乐代理”可能是一个涉及成人内容的微信活动或群体。 “成人视频”是指可能包含成人视频或直播内容的服务。】大家可以看到,实际回答跟我们的问题一点关系都没有

hauntsaninja commented 1 month ago

I believe the folks who chose the vocabulary for GPT-4o are now aware of this, maybe they'll ship a patch to the vocab or GPT-4o or both.

This is known as the solidgoldmagikarp problem, named after a similarly problematic token from the GPT-2 vocabulary.

echo-valor commented 1 month ago

请问你是如何得到其中文词表的?

alexhmyang commented 1 month ago

I believe the folks who chose the vocabulary for GPT-4o are now aware of this, maybe they'll ship a patch to the vocab or GPT-4o or both.我相信为 GPT-4o 选择词汇表的人们现在已经意识到这一点,也许他们会为 vocab 或 GPT-4o 或两者提供补丁。

This is known as the solidgoldmagikarp problem, named after a similarly problematic token from the GPT-2 vocabulary.这被称为“solidgoldmagikarp 问题”,以 GPT-2 词汇表中类似问题的标记命名。

hope they notice this issue and solve this problem, i think this is also related with openai safety team