Closed peterjc123 closed 1 year ago
Another question, I noticed that the maximum output length is limited to 100 by default. Is this really fair for models like LLAMA? They represent a Chinese character using more tokens than the Chinese models like Baichuan and InternLM.
Thanks for your bug report.
The difference comes from the inference framework. We used InterLM framework at first and reported its result 50.80 on OpenCompass site, however, the default config of internlm-chat-7b we provided is using huggingface framework, thus the result of 50.52. There are subtle differences between the two frameworks that we cannot so far fully figure out. We are working on the integation of internlm framework to OpenCompass, please watch https://github.com/InternLM/opencompass/pull/51 for latest progress.
Here is some more detailed results:
inference framework | inference method | mmlu | |
---|---|---|---|
internlm-chat-7b-hf | huggingface | ppl_ac766d | 50.52 |
internlm-chat-7b-hf | huggingface | gen_a484b3 | 51.14 |
internlm-chat-7b-hf | InternLM | ppl_ac766d | 50.80 |
As for the 100 output tokens limitation, most of the chinese dataset configs will has max_out_len
parameter, which will override max_out_len
in the model config. The model can output like 1024 tokens during inference on those datasets. Take this for an example
Feel free to re-open this issue if you have more questions.
As for the 100 output tokens limitation, most of the chinese dataset configs will has
max_out_len
parameter, which will overridemax_out_len
in the model config. The model can output like 1024 tokens during inference on those datasets. Take this for an example
@Leymore For example, for the dataset lcsts
, the max sequence length is not overrided ,so that the results will be incomplete for LLAMA. https://github.com/InternLM/opencompass/blob/840a8ebecb104aae82e27ddc4febe2960bae92db/configs/datasets/lcsts/lcsts_gen_8ee1fe.py#L9
LCSTS is a summarization dataset, whose average output is 18 characters, and max output is 24 characters. We believe 100 tokens is way more than sufficient for LCSTS dataset, so no override is taken.
Here is part of the output of LLAMA65B, we believe incomplete output is not the main problem.
{
"0": {
"origin_prompt": "阅读以下文章,并给出简短的摘要:日前,方舟子发文直指林志颖旗下爱碧丽推销假保健品,引起哗然。调查发现,爱碧丽没有自己的生产加工厂。其胶原蛋白饮品无核心研发,全部代工生产。号称有“逆生长”功效的爱碧丽“梦幻奇迹限量组”售价高达1080元,实际成本仅为每瓶4元!\n摘要如下:",
"reference": "林志颖公司疑涉虚假营销无厂房无研发",
"prediction": "\n1. 爱碧丽的胶原蛋白饮品全部代工生产,没有自己的生产加工厂。\n2. 爱碧丽的胶原蛋白饮品没有核心研发,全部是“盗版”。\n3."
},
"1": {
"origin_prompt": "阅读以下文章,并给出简短的摘要:韩方应对路径可以概括为:企业道歉担责;政府公正不护短;民间祈福关怀。他们深知形象的重要,竭力呵护企业品牌和国家形象。正如有评论,韩国“政府+企业+民众”三位一体式呵护韩国国家形象的“苦心经营”,的确有值得我们借鉴之处。\n摘要如下:",
"reference": "从韩亚航空事故看其应对路径",
"prediction": "韩方应对路径可以概括为:企业道歉担责;政府公正不护短;民间祈福关怀。他们深知形象的重要,竭力呵护企业品牌和国家形象。正如有评��"
},
"2": {
"origin_prompt": "阅读以下文章,并给出简短的摘要:63岁退休教师谢淑华,拉着人力板车,历时1年,走了2万4千里路,带着年过九旬的妈妈环游中国,完成了妈妈“一辈子在锅台边转,也想出去走走”的心愿。她说:“妈妈愿意出去走走,我就愿意拉着,孝心不能等,能走多远就走多远。\n摘要如下:",
"reference": "女子用板车拉九旬老母环游中国1年走2万4千里",
"prediction": "\n1. 谢淑华是一名63岁的退休教师,她和妈妈一起走了2万4千里路,带着年过九旬的妈妈环游中国。\n2. 谢淑华说:“妈�"
},
"3": {
"origin_prompt": "阅读以下文章,并给出简短的摘要:昨天,包括工农中建交五大行在内的多家银行,不约而同地在官网发布公告称,它们的房地产贷款政策没有变化。多家银行表示,会支持居民购买首套住房。一名金融问题专家称,“目前房价不具备大涨大跌的基础,特别是一二线城市狂跌的可能性小。”\n摘要如下:",
"reference": "银行集体发声:房贷政策没变",
"prediction": "\n1. 多家银行发布公告,表示房地产贷款政策没有变化。\n2. 多家银行表示,会支持居民购买首套住房。\n3. 一名金融问题专家称,“目前房价不具备"
},
"4": {
"origin_prompt": "阅读以下文章,并给出简短的摘要:广东4名律师致函中国民航局,要求其规定,航班起飞前要向乘客公布机组人员信息包括安全飞行时间、职业经历等。沪上业内人士认为,一般能被安排执飞任务的飞行员,均拥有民航认可的飞行资质。而责任心和使命感是无法通过飞行时间反映。\n摘要如下:",
"reference": "四律师上书民航总局:起飞前应公布机长信息",
"prediction": "\nFour lawyers from Guangdong province have written to the Civil Aviation Administration of China (CAAC), demanding that it require airlines to publicize information about their pilots before takeoff, including their flight safety records and professional experience.\nThe lawyers said the move would help passengers make informed choices about which airlines to fly with.\nThe letter was sent to the CAAC on Monday, the same day that a China Eastern Airlines flight from Xi"
},
"5": {
"origin_prompt": "阅读以下文章,并给出简短的摘要:任教五十年,钱理群在2012年教师节前夕宣布“告别教育”。从北大退休后,钱理群投身中学教育,试图“改变人心”,他以鲁迅自励,要在绝望中反抗,但基础教育十年试水,却令他收获“丰富的痛苦”。他说,—切不能为应试教育服务的教育根本无立足之地。\n摘要如下:",
"reference": "钱理群“告别教育”",
"prediction": "\n钱理群在2012年教师节前夕宣布“告别教育”,并在微博上发表了一篇“告别教育”的文章,引发了广泛关注。\n钱理群在北京大学教育学院毕业后,担任教师,在教"
},
"6": {
"origin_prompt": "阅读以下文章,并给出简短的摘要:①北京和上海户籍的游客可获得韩国多次签证;②“整容客”可以不经由韩国使领馆、直接在网上申请签证;③中泰免签的实施日期尚未敲定;④越南已向中国持通行证旅游的公民全面开放。\n摘要如下:",
"reference": "中国游客大增多国放宽签证",
"prediction": "①北京和上海户籍的游客可获得韩国多次签证;②“整容客”可以不经由韩国使领馆、直接在网上申请签证;③中泰免签的实施日期尚未��"
},
"7": {
"origin_prompt": "阅读以下文章,并给出简短的摘要:12月12日,多家被立案稽查的沪市公司集体对外发布退市风险提示公告,*ST国创位列\"\"黑名单\"\"。目前证监会调查仍在进行,*ST国创尚未收到此次立案调查书面结论意见。一旦立案调查事项触及相关规定,公司股票将被实施退市风险警示。\"\n摘要如下:",
"reference": "信披违规外加业绩亏损*ST国创退市风险概率大增",
"prediction": "12月12日,多家被立案稽查的沪市公司集体对外发布退市风险提示公告,*ST国创位列\"\"黑名单\"\"。目前证监会调查仍在进行,*ST国创尚未收到此次立案调查书面结�"
},
"8": {
"origin_prompt": "阅读以下文章,并给出简短的摘要:据微信公众号“界面”报道,4日上午10点左右,中国发改委反垄断调查小组突击查访奔驰上海办事处,调取数据材料,并对多名奔驰高管进行了约谈。截止昨日晚9点,包括北京梅赛德斯-奔驰销售服务有限公司东区总经理在内的多名管理人员仍留在上海办公室内\n摘要如下:",
"reference": "发改委反垄断调查小组突击调查奔驰上海办事处",
"prediction": "据微信公众号“界面”报道,4日上午10点左右,中国发改委反垄断调查小组突击查访奔驰上海办事处,调取数据材料,并对多名奔驰高管进行了约谈。��"
},
"9": {
"origin_prompt": "阅读以下文章,并给出简短的摘要:12日,上周遭到极端分子恐怖袭击的《查理周刊》公布了“幸存者专刊”的封面,绿色的底色上,穆罕默德嘴角向下,手中举着一周来所有人都耳熟能详的口号:我是查理。这名伊斯兰教先知的头顶上还写着一句话:宽恕与原谅。该封面的作者上周因为迟到逃过一劫。\n摘要如下:",
"reference": "巴黎查理周刊最新一期封面(图)",
"prediction": "\nThe cover of the latest issue of Charlie Hebdo, the French satirical magazine that was the target of a terrorist attack last week, features a cartoon of the Prophet Muhammad. The green background is the same as the one used in Charlie Hebdo’s last issue, which came out the week of the attack. Muhammad is depicted crying and holding a sign that says, in French, “I am Charlie.” Above his head is the phrase “All is"
}
}
LCSTS is a summarization dataset, whose average output is 18 characters, and max output is 24 characters. We believe 100 tokens is way more than sufficient for LCSTS dataset, so no override is taken.
Here is part of the output of LLAMA65B, we believe incomplete output is not the main problem.
{ "0": { "origin_prompt": "阅读以下文章,并给出简短的摘要:日前,方舟子发文直指林志颖旗下爱碧丽推销假保健品,引起哗然。调查发现,爱碧丽没有自己的生产加工厂。其胶原蛋白饮品无核心研发,全部代工生产。号称有“逆生长”功效的爱碧丽“梦幻奇迹限量组”售价高达1080元,实际成本仅为每瓶4元!\n摘要如下:", "reference": "林志颖公司疑涉虚假营销无厂房无研发", "prediction": "\n1. 爱碧丽的胶原蛋白饮品全部代工生产,没有自己的生产加工厂。\n2. 爱碧丽的胶原蛋白饮品没有核心研发,全部是“盗版”。\n3." }, "1": { "origin_prompt": "阅读以下文章,并给出简短的摘要:韩方应对路径可以概括为:企业道歉担责;政府公正不护短;民间祈福关怀。他们深知形象的重要,竭力呵护企业品牌和国家形象。正如有评论,韩国“政府+企业+民众”三位一体式呵护韩国国家形象的“苦心经营”,的确有值得我们借鉴之处。\n摘要如下:", "reference": "从韩亚航空事故看其应对路径", "prediction": "韩方应对路径可以概括为:企业道歉担责;政府公正不护短;民间祈福关怀。他们深知形象的重要,竭力呵护企业品牌和国家形象。正如有评��" }, "2": { "origin_prompt": "阅读以下文章,并给出简短的摘要:63岁退休教师谢淑华,拉着人力板车,历时1年,走了2万4千里路,带着年过九旬的妈妈环游中国,完成了妈妈“一辈子在锅台边转,也想出去走走”的心愿。她说:“妈妈愿意出去走走,我就愿意拉着,孝心不能等,能走多远就走多远。\n摘要如下:", "reference": "女子用板车拉九旬老母环游中国1年走2万4千里", "prediction": "\n1. 谢淑华是一名63岁的退休教师,她和妈妈一起走了2万4千里路,带着年过九旬的妈妈环游中国。\n2. 谢淑华说:“妈�" }, "3": { "origin_prompt": "阅读以下文章,并给出简短的摘要:昨天,包括工农中建交五大行在内的多家银行,不约而同地在官网发布公告称,它们的房地产贷款政策没有变化。多家银行表示,会支持居民购买首套住房。一名金融问题专家称,“目前房价不具备大涨大跌的基础,特别是一二线城市狂跌的可能性小。”\n摘要如下:", "reference": "银行集体发声:房贷政策没变", "prediction": "\n1. 多家银行发布公告,表示房地产贷款政策没有变化。\n2. 多家银行表示,会支持居民购买首套住房。\n3. 一名金融问题专家称,“目前房价不具备" }, "4": { "origin_prompt": "阅读以下文章,并给出简短的摘要:广东4名律师致函中国民航局,要求其规定,航班起飞前要向乘客公布机组人员信息包括安全飞行时间、职业经历等。沪上业内人士认为,一般能被安排执飞任务的飞行员,均拥有民航认可的飞行资质。而责任心和使命感是无法通过飞行时间反映。\n摘要如下:", "reference": "四律师上书民航总局:起飞前应公布机长信息", "prediction": "\nFour lawyers from Guangdong province have written to the Civil Aviation Administration of China (CAAC), demanding that it require airlines to publicize information about their pilots before takeoff, including their flight safety records and professional experience.\nThe lawyers said the move would help passengers make informed choices about which airlines to fly with.\nThe letter was sent to the CAAC on Monday, the same day that a China Eastern Airlines flight from Xi" }, "5": { "origin_prompt": "阅读以下文章,并给出简短的摘要:任教五十年,钱理群在2012年教师节前夕宣布“告别教育”。从北大退休后,钱理群投身中学教育,试图“改变人心”,他以鲁迅自励,要在绝望中反抗,但基础教育十年试水,却令他收获“丰富的痛苦”。他说,—切不能为应试教育服务的教育根本无立足之地。\n摘要如下:", "reference": "钱理群“告别教育”", "prediction": "\n钱理群在2012年教师节前夕宣布“告别教育”,并在微博上发表了一篇“告别教育”的文章,引发了广泛关注。\n钱理群在北京大学教育学院毕业后,担任教师,在教" }, "6": { "origin_prompt": "阅读以下文章,并给出简短的摘要:①北京和上海户籍的游客可获得韩国多次签证;②“整容客”可以不经由韩国使领馆、直接在网上申请签证;③中泰免签的实施日期尚未敲定;④越南已向中国持通行证旅游的公民全面开放。\n摘要如下:", "reference": "中国游客大增多国放宽签证", "prediction": "①北京和上海户籍的游客可获得韩国多次签证;②“整容客”可以不经由韩国使领馆、直接在网上申请签证;③中泰免签的实施日期尚未��" }, "7": { "origin_prompt": "阅读以下文章,并给出简短的摘要:12月12日,多家被立案稽查的沪市公司集体对外发布退市风险提示公告,*ST国创位列\"\"黑名单\"\"。目前证监会调查仍在进行,*ST国创尚未收到此次立案调查书面结论意见。一旦立案调查事项触及相关规定,公司股票将被实施退市风险警示。\"\n摘要如下:", "reference": "信披违规外加业绩亏损*ST国创退市风险概率大增", "prediction": "12月12日,多家被立案稽查的沪市公司集体对外发布退市风险提示公告,*ST国创位列\"\"黑名单\"\"。目前证监会调查仍在进行,*ST国创尚未收到此次立案调查书面结�" }, "8": { "origin_prompt": "阅读以下文章,并给出简短的摘要:据微信公众号“界面”报道,4日上午10点左右,中国发改委反垄断调查小组突击查访奔驰上海办事处,调取数据材料,并对多名奔驰高管进行了约谈。截止昨日晚9点,包括北京梅赛德斯-奔驰销售服务有限公司东区总经理在内的多名管理人员仍留在上海办公室内\n摘要如下:", "reference": "发改委反垄断调查小组突击调查奔驰上海办事处", "prediction": "据微信公众号“界面”报道,4日上午10点左右,中国发改委反垄断调查小组突击查访奔驰上海办事处,调取数据材料,并对多名奔驰高管进行了约谈。��" }, "9": { "origin_prompt": "阅读以下文章,并给出简短的摘要:12日,上周遭到极端分子恐怖袭击的《查理周刊》公布了“幸存者专刊”的封面,绿色的底色上,穆罕默德嘴角向下,手中举着一周来所有人都耳熟能详的口号:我是查理。这名伊斯兰教先知的头顶上还写着一句话:宽恕与原谅。该封面的作者上周因为迟到逃过一劫。\n摘要如下:", "reference": "巴黎查理周刊最新一期封面(图)", "prediction": "\nThe cover of the latest issue of Charlie Hebdo, the French satirical magazine that was the target of a terrorist attack last week, features a cartoon of the Prophet Muhammad. The green background is the same as the one used in Charlie Hebdo’s last issue, which came out the week of the attack. Muhammad is depicted crying and holding a sign that says, in French, “I am Charlie.” Above his head is the phrase “All is" } }
Considering the prompt, it may be insuffcient. Using our custom LLAMA, it somehows exceeds the limit.
Describe the bug
The result I get is listed below.
But according to the leaderboard, it should be 50.8
Environment
...
Other information
No response