nidhaloff / deep-translator

A flexible free and unlimited python tool to translate between different languages in a simple way using multiple translators.
https://deep-translator.readthedocs.io/en/latest/?badge=latest
Apache License 2.0
1.55k stars 177 forks source link

Translation from Simplified Chinese to Eng successful but incorrect or unreliable if run multiple times #273

Open RiccardoSH opened 1 month ago

RiccardoSH commented 1 month ago

Descripion

I tested batch translation for a column of product names from Chinese (Simplified Chinese Mandarin) to English, code below ran smoothly with no output error. After manually checking samples of the resulting translations exported to .csv and .xlxs, I found that many were translated incorrectly.

Batch size: over 5000 rows under df['Product_Name'] column, see below. No N/A or missing values in the original column. No N/A or missing values in the resulting df['Product_Name_Eng'] column.

Example of a wrong translation: "life space益生菌大人调理肠胃肠道双歧杆菌元免疫力提旗舰店正品" has been translated as "[Second sale] wlab makeup primer, primer, invisible pores, flagship store, genuine product, valid until 24/06" on the corresponding row.

Not all are wrong, I'd say most are correct (stopped manually checking after a while) Example of a successful translation: "康萃乐儿童益生菌宝宝婴幼儿调理肠胃鼠李糖乳杆菌lgg冲剂30袋" has been translated as "Kangcuile children's probiotics baby infant gastrointestinal conditioning Lactobacillus rhamnosus LGG granules 30 bags" on the corresponding row.

What I did

I ran the following in Jupyter Lab 3.6.3

import pandas as pd
from deep_translator import GoogleTranslator
from concurrent.futures import ThreadPoolExecutor

df = pd.read_csv('test_translation.csv')
translator = GoogleTranslator(source='chinese (simplified)', target='english')

def translate_text(text):
    try:
        return translator.translate(text)
    except Exception as e:
        return text  # Return the original text if translation fails

def batch_translate_texts(texts):
    with ThreadPoolExecutor(max_workers=10) as executor:
        translated_texts = list(executor.map(translate_text, texts))
    return translated_texts

product_names = df['Product_Name'].astype(str).tolist()
translated_names = batch_translate_texts(product_names)

df['Product_Name_Eng'] = translated_names

I then re-run the same code on a smaller batch of 600+ rows, the same string that was originally translated incorrectly in my example has been translated correctly the second time.

Same string this time translated correctly: "life space益生菌大人调理肠胃肠道双歧杆菌元免疫力提旗舰店正品" has been translated as "Life space probiotics for adults regulate gastrointestinal tract Bifidobacterium Yuan immunity enhance flagship store authentic" which is an acceptable automatic translation for my project.

Need help to understand how to ensure reliable translation results. Thank you