machinewrapped / gpt-subtrans

Open Source project using LLMs to translate SRT subtitles
Other
311 stars 36 forks source link

Character Name Being Translated #89

Closed fenixpsicologia closed 5 months ago

fenixpsicologia commented 7 months ago

Hello, how are you? First, I would like to thank you for the tool!

Something I have noticed is that some names placed in the "Characters" column are being ignored by ChatGPT and being translated.

For example, in a context of translating from English to Portuguese, if the subtitle text mentions "MegaRed" and the "Characters" column has "MegaRed," ChatGPT is still translating "MegaRed" to "Mega Vermelho."

I understand that the "Characters" column represents proper names that should not be translated, but perhaps that is not its intended function, and I may have misunderstood. Either way, I wanted to provide my feedback.

machinewrapped commented 7 months ago

Hi, yes ideally GPT would recognise that it is a character name and should not be translated... but it is not very good at it :-/

I will see if I can improve the prompt so that it pays more attention to character names and understands that it should use them consistently. If you can provide me with an example subtitle where it is getting it wrong I'll take a look.

In the meantime you can use the substitutions option to convert back from the translated character name to the original name, e.g. add the following:

Mega Vermelho::MegaRed

It's not ideal, but may be worth the effort if you're translating a series and can build up a substitution list.

fenixpsicologia commented 7 months ago

Hi, yes ideally GPT would recognise that it is a character name and should not be translated... but it is not very good at it :-/

I will see if I can improve the prompt so that it pays more attention to character names and understands that it should use them consistently. If you can provide me with an example subtitle where it is getting it wrong I'll take a look.

In the meantime you can use the substitutions option to convert back from the translated character name to the original name, e.g. add the following:

Mega Vermelho::MegaRed

It's not ideal, but may be worth the effort if you're translating a series and can build up a substitution list.

I got a small sample from the subtitle.

https://drive.google.com/file/d/1E0g_OZ51IzJ9Tc1opdfAbr0lvlQroLs6/view?usp=sharing

List of characters (or proper names that do not need to be translated):

Kouichiro
Mega Red
Mega Pink
Mega Blue
Mega Yellow
Mega Black
Drill Saber
Mega Rod
Mega Tomahawk
Mega Sling
Mega Capture
Tomahawk Hurricane
machinewrapped commented 7 months ago

Sorry, I missed the file. Can you send a copy to machinewrapped at gmail dot com?

fenixpsicologia commented 6 months ago

Sorry, I missed the file. Can you send a copy to machinewrapped at gmail dot com?

Okay, I updated the link.

machinewrapped commented 6 months ago

I spent some time trying to improve the handling of the names in this example. v0.5.2 has some changes to try to encourage the AI to pay attention to the names list, but as I expected the GPT 3.5 turbo model isn't really smart enough to understand when to use the names. https://github.com/machinewrapped/gpt-subtrans/releases/tag/v0.5.2

Interestingly the turbo-instruct model performs noticeably better (and gpt-4 is better yet). https://github.com/machinewrapped/gpt-subtrans/discussions/101

v0.5.2 is a pre-release for now until I've done some more testing, and it may get superseded by a newer version, but I think it should be safe to use and contains some general bug fixes. If you get chance please give it a try and let me know if you find it to be an improvement.

fenixpsicologia commented 5 months ago

I did some tests and it improved a little, but it still happens. I think it's something in the model itself.

machinewrapped commented 5 months ago

Yeah, it seems GPT3.5 isn't smart enough to reliably recognise when to use the provided names - GPT4 is much better, but you pay a big premium for it.

Neoony commented 5 months ago

GPT 3.5 was always much worse at following instructions too, especially the more complicated ones, not surprised here And also the more tokens you would send, the worse it would get, until it might completely slip away from the instructions. Had to always do many workarounds to make it do things properly, which I could basically all remove with GPT 4 or 4 Turbo (just in my own things I develop)

I have not used the instruct/later versions of 3.5 that much though

Anyways GPT 4 turbo models are quite cheaper than just GPT 4 These models I mean gpt-4-0125-preview (latest, I am using this one everywhere) gpt-4-1106-preview https://openai.com/pricing

Also the 128k input tokens are amazing (when you need many instructions and examples for something) Although it will only output 4096 max (for subtitles you dont want to cross that) Follows instructions very well, sometimes in some scenarios maybe even too much 😄

Its just day and night difference between 3.5 and 4 or 4 Turbo

And I just overall like that one better for everything, plus the dataset goes to April 2023 Works great for subtitles too, although I never do anything with names

But yeah 3.5 is indeed super cheap

machinewrapped commented 5 months ago

I'll have to do more testing with the newer models. I've found that whilst GPT4 exhibits better comprehension (e.g. the summaries it provides are much more detailed) the quality of the translation isn't necessarily better. When I translate something with GPT4 I usually do another pass with 3.5-turbo and merge them by picking the best lines from each (https://github.com/machinewrapped/srt-merger) - the result tends to be quite evenly split between them.

It might depend on the target language though, translating to English is probably "easy" due to the wealth of English text in the training data, so GPT4's deeper understanding could make a bigger difference when translating to a less common language.

Neoony commented 5 months ago

its specifically mentioned that GPT4 is better with non-english languages image

I always translate to Czech and its just great and much better than any human translator. Of course there are things it cant do fully correctly, like is it a man or woman speaking, but that just makes complete sense and is expected, there is just no way it can know in most cases (it does not know who is speaking unless subtitles have names). But these are really details for me.

My main issue with (Czech) human translations is that they would often completely change the sentence and just get the meaning out (or sometimes even change the meaning). Or translate some sentence but arrange it almost exactly opposite way. Or they often split sentence and put the part that was said later to sooner and the sooner part later, just does not even fit what actors said in that moment. And they often want the subtitles out ASAP so they rush it. Simply human translators often have hard time figuring out how to do some sentence in a way that would really fit much more word to word. GPT with its huge dataset can almost always build the sentence almost exactly the same way as it was in english, does not skip complicated words and similar. Its just so much better for also learning english from subtitles, because its not completely different words/sentences translated just to get the meaning out.

I did also mention to "translate as is" in my instructions, and yeah sometimes you just cant build the sentence the same way and be correct, but this does a really good job anyways...its still correct sentences that make sense, it does not make it literally as is, as its indeed different language which works differently, but its just much much better at that

Normally I dont even use subtitles, but someone close to me needs them and also learns English from them. Really glad you created this software 😃

Not really talking about professional human translations, but rather what you find right after movie/TV show goes out and there is no pro translations. (rarely there are) And I am not expecting super polished results, just good translation without skipping or rearranging words/sentences or completely changing meaning sometimes. And this always delivers for me with GPT 4 Turbos. (in one go, with only rare issues when source subtitles have some weird things/formatting) The only mistakes are the ones that make sense that it cant do. (like man and woman saying sometimes, but thats about it)

But yeah, I have never even bothered using 3.5 for subtitle translation I am just not really looking to use that model, I had enough of it and its issues 😄

machinewrapped commented 5 months ago

That's great to hear!