ryanhex53 / gpt-po

command tool for translate po files by using openai api
69 stars 13 forks source link

sharing my thoughts on efficiently translating po files with LLMs... #14

Open leolivier opened 3 months ago

leolivier commented 3 months ago

Hi @ryanhex53 This is not an issue, just me sharing my thoughts on efficiently translating po files with LLMs...

I stumbled upon your project (and others) while looking for a way to translate po files from an existing translated one to help disambiguate translations. For example, if you provide a po file with

msgid "Bank"
msgstr ""

No tool, no matter how intelligent, is ever going to know if it's talking about a financial company or the banks of a river. But if you provide a French translation:

msgid: "Bank"
msgstr: "Banque"

then any translation capable LLM should be able to translate that po file entry into Spanish or German or whatever language it knows.

Have you ever thought about this?

Actually, with the help of Claude and ChatGPT (free versions ;) I tried it myself and ended up with this pretty simple piece of Python code (I know yours is Typescript, it's just for the example) that

  1. loads a LLM specialized in translation (facebook/mbart-large-50-many-to-many-mmt),
  2. reads the po file entry by entry
  3. tries to provide a translation of the English msgid using the French translation as context:
import polib
from transformers import pipeline

def translate_po_file(input_file, output_file):
  # Load multilingual translation template
  translator = pipeline("translation", model="facebook/mbart-large-50-many-to-many-mmt")
  # Load input .po file
  po = polib.pofile(input_file)
  # Browse each entry and translate
  for entry in po:
    if entry.msgid and not entry.fuzzy:
      # Preparing the context and the text to be translated
      context = entry.msgstr if entry.msgstr else entry.msgid
      text_to_translate = entry.msgid
      # Building the prompt
      prompt = f"Translate to Spanish. Context: {context}\nText: {text_to_translate}"
      # Translate into Spanish
      translation = translator(prompt, src_lang="en_XX", tgt_lang="es_XX")[0]['translation_text']
      # Extract the translated part (after "Text: ")
      translation = translation.split("Text: ")[-1].strip()
      # Update translation
      entry.msgstr = translation
      # Save the new .po file
      po.save(output_file)

# Example of use
input_file = "input.po"
output_file = "output.es.po"
translate_po_file(input_file, output_file)

(You'll need to `pip install' several libraries before running this code (I've done several tests so I'm not sure they're all still needed).

polib
transformers
torch
sentencepiece
sacremoses
protobuf

)

Unfortunately, this does not work very well. I think I have to 1rst deal with the placeholders included in the po files (e.g. {some_variable} or %s or %(some_variables)s ...) and probably provide a much better prompt to explain to the model how to use the context for translation...

So, I had a look on github, where there are a lot of "gpt for po" projects like yours, but I didn't find any that use an already done translation as disambiguation context, although I think this is absolutely key for po, where sentences are very short (even just one word) and thus don't provide enough context for the translator to work properly...

Maybe the start of a PhD thesis :D

ryanhex53 commented 21 minutes ago

To some extent, perhaps I have a method to improve this issue, which is to send more translation entries at once. This can also save the system prompts that are repeatedly sent each time. I'm planning to make improvements on this in the next version.