Warning: Can only detect less than 5000 characters

lushan88a / google_trans_new

A free and unlimited python API for google translate.

MIT License

393 stars 170 forks source link

Warning: Can only detect less than 5000 characters #29

Open bjquinniii opened 3 years ago

bjquinniii commented 3 years ago

In general, when working with Google translation services, I prefer to let Google detect the source language. So my original pull looked like this:

postTextTrans = translator.translate(postText, lang_tgt='en')

This works great, except when the source text exceeds 5000 characters (and I started seeing the subject error).

So I changed my call to this:

postTextTrans = translator.translate(postText, lang_src='pt', lang_tgt='en')

It seems like specifying the source language should bypass the detection step and avoid the error but I'm still getting the same error. Which probably means that either the error is not really indicating it's complaint, or the lang_src parameter isn't working properly.

Has anyone else run into this? Or found a solution?

I'm probably going to work on implementing some code to break the text into smaller chunks and then reassemble the results, but would prefer not to have to do that.

danisuba10 commented 3 years ago

I had this same problem. Google Translate has a limit of 5000 characters, which you cannot bypass. You need to break the file into pieces which are smaller then 5000 characters. If you need help implementing this, I can try explaining it, or showing the code.

bjquinniii commented 3 years ago

Another case of slightly misleading error messages... I think I can implement the code. Been thinking I should just divide the text block into sentences, but while pondering that I think abbreviations might throw off a simple string.split('.') type of logic. If you've come up with a good solution would appreciate it.

danisuba10 commented 3 years ago

I can't show you code now, only tomorrow, but I can describe the solution. While splitting the file into sentences may work, but it isn't really efficient. What I've done is start from 5000, and iterate backwards character by character until I find a "." or a japanese/chinese "." , etc.. You can choose multiple characters, the important thing is that you stop at the first character which indicates the end of a sentence. Now you use split the string with part = file[start:end] which will give you a 48xx-49xx character part which you now need to translate and append to the output file. You add 5000 to the last found position of a "." and iterate backwards again. If end>len(file) (where file is the string containing the whole book/text) stop. After this check if there is a remaining part which is untranslated. If you can't understand something, ask me to clarify. I will send you the code tomorrow.

danisuba10 commented 3 years ago

` def translate_whole(file, result_file_translate_whole, result_language_translate_whole): t_start = time.perf_counter()

length = len(file)
start = 5000
last = 0
is_left = 1
while is_left == 1:
    if start < length:
        for i in range(start, last, -1):
            if file[i] == ' ' or file[i] == '。' or file[i] == '.' or file[i] == '」':
                start = i
                trans(file[last:start], result_file_translate_whole, result_language_translate_whole)
                last = start
                start = i + 5000
                break
    else:
        is_left = 0
if start > length:
    trans(file[start - 5000:length], result_file_translate_whole, result_language_translate_whole)

t_stop = time.perf_counter()
print(f'Translation time was: {t_stop - t_start} seconds')

` For some reason part of the code doesn't get included, sorry for that.