Open eu9ene opened 2 months ago
I tried a similar annotation as in the DA+SQM approach that was used in WMT23/22 for human evaluation. Here's evaluation of 5 ru-en Flores examples by GPT-4o: https://chatgpt.com/share/2daa0500-ebd8-46ea-9967-bd54026cb229
The main problem is that the model is not consistent in its answers. Changing the prompt or just regenerating the answer leads to different scores and sometimes it points to different mistakes. Also sometimes it hallucinates the mistakes...
Prompt:
Access the machine translation quality on a continuous scale using the quality levels described as follows. You can also use any values in between.
0: "Nonsense/No meaning preserved": Nearly all information is lost between the translation and source. Grammar is irrelevant.
2: "Some meaning preserved": The translations preserves some of the meaning of the source but misses significant parts. The narrative is hard to follow due to fundamental errors. Grammar may be poor.
4: "Most meaning preserved and few grammar mistakes": The translation retains most of the meaning of the source. It may have some grammar mistakes or minor contextual inconsistencies.
6: "Perfect meaning and grammar": The meaning of the translation is completely consistent with the source. The grammar is also correct.
For each score double check that it satisfies the provided guidance and is not misleading.
Output a separate score for each translation. If the score is not perfect, highlight the mistakes, otherwise output only the score. Do not repeat the source and target text. Ouput has to be in the following JSON format:
[
{"score": N, "mistakes": ["<text with explanation 1>", "<text with explanation 2>"...]},
{"score": M}
]
Here are the translations from Russian to English.
Russian:
"Теперь у нас есть четырёхмесячные мыши, у которых больше нет диабета", — добавил он.
English:
“Now we have four-month-old mice that no longer have diabetes,” he added.
Russian:
Согласно предупреждению доктора Эхуда Ура (Ehud Ur), профессора медицины в Университете Дэлхаузи в Галифаксе (Новая Шотландия) и председателя клинико-научного отдела Канадской диабетической ассоциации, исследования все еще находятся на начальной стадии.
English:
According to the warning of Dr. Ehud Ur, professor of medicine at the University of Dalhausy in Halifax (New Scotia) and chairman of the clinical and scientific department of the Canadian Diabetes Association, the research is still in the initial stages.
Russian:
Как и некоторые другие эксперты, он сомневается, что диабет излечим, отмечая, что эти открытия не относятся к людям, у которых уже развился диабет 1 типа.
English:
Like some other experts, he doubts that diabetes is curable, noting that these findings do not apply to people who have already developed type 1 diabetes.
Russian:
В понедельник Сара Даниус, постоянный секретарь Нобелевского комитета по литературе при Шведской Академии, публично заявила во время передачи на шведской радиостанции Sveriges о том, что комитет, которому не удалось связаться с Бобом Диланом напрямую и сообщить ему, что он выиграл Нобелевскую премию по литературе за 2016 год, прекращает попытки выйти с ним на связь.
English:
On Monday, Sarah Danius, Permanent Secretary of the Nobel Committee for Literature at the Swedish Academy, publicly stated during a Swedish radio station Sveriges that the committee, which failed to contact Bob Dylan directly and to tell him that he had won the 2016 Nobel Prize in Literature was ending his attempts to get in touch with him.
Russian:
Даниус заявил: "Сейчас мы ничего не делаем. Я звонил и отправлял электронные письма его ближайшему сотруднику и получал очень дружелюбные ответы. На данный момент этого, конечно, достаточно".
English:
Danius said: "Now we are not doing anything. I called and sent emails to his closest employee and received very friendly answers. At the moment, this, of course, is enough."
Answer:
[
{"score": 6},
{"score": 2, "mistakes": ["\"According to the warning of Dr. Ehud Ur\" - The word \"warning\" is a mistranslation; the correct term should be \"statement\" or \"caution.\" \"New Scotia\" should be \"Nova Scotia.\""]},
{"score": 6},
{"score": 2, "mistakes": ["\"and to tell him that he had won the 2016 Nobel Prize in Literature was ending his attempts to get in touch with him\" - This part of the sentence is misleading and confusing. The phrase \"was ending his attempts to get in touch with him\" should be rephrased to indicate that the committee was ending its attempts to contact him."]},
{"score": 4, "mistakes": ["\"На данный момент этого, конечно, достаточно\" - The translation \"At the moment, this, of course, is enough\" slightly misses the contextual meaning. A more accurate translation would be \"For now, this is, of course, sufficient.\""]}
]
It would be interesting to compare evaluation capabilities of LLMs to COMET and human evaluation.
See the paper: Large Language Models Are State-of-the-Art Evaluators of Translation Quality