How to translate subtitle .srt

ikergarcia1996 / Easy-Translate

Easy-Translate is a script for translating large text files with a SINGLE COMMAND. Easy-Translate is designed to be as easy as possible for beginners and as seamlesscustomizable and as possible for advanced users.

Apache License 2.0

189 stars 307 forks source link

I use this command

python3 translate.py \
--sentences_path input.srt \
--output_path result.srt \
--source_lang eng_Latn \
--target_lang ind_Latn \
--model_name facebook/nllb-200-distilled-600M \
--precision fp16

with input.srt

1
00:00:07,312 --> 00:00:09,993
Hello.

2
00:00:09,994 --> 00:00:11,227
Where are you right now?

3
00:00:11,228 --> 00:00:13,360
Right now I am on my way
to South Dakota.

4
00:00:13,361 --> 00:00:16,093
Gonna do a little camping,
do a little fishing.

5
00:00:16,094 --> 00:00:17,426
Good for you, Colter.

but the result.srt has problems:

wrong order
empty line replace with (dalam bahasa Inggris)
appended unknown

1
00:00:07,312 --> 00:00:09,993
Hei, apa yang kau lakukan?
(dalam bahasa Inggris) <-- this should be empty line
2 (satu) <-- the '(satu)' should not be exist
00:00:09,994 --> 00:00:11,227
Di mana kau sekarang?
(dalam bahasa Inggris) ....
3 Pemberantasan Korupsi <-- this also should not be exist
00:00:11,228 --> 00:00:13,360
Saat ini aku sedang dalam perjalanan
ke Dakota Selatan.
(dalam bahasa Inggris) ...
4
00:00:13,361 --> 00:00:16,093
Akan pergi berkemah sedikit,
lakukan sedikit memancing.
(dalam bahasa Inggris) ...
5
00:00:16,094 --> 00:00:17,426
Bagus untukmu, Colter.
(dalam bahasa Inggris) ...

Had a similar need and the issue ofc boils down to EasyTranslate requiring that every line in the input file is translatable.

Attached patch makes it so that when a line contains only numbers and/or non-alphabetical characters it is not translated but pulled aside and then printed back out during output phase (maybe there's a cleaner way but it appears that whatever is added to the pytorch Dataset structure has to be compatible with accelerator.prepare() so as workaround a collate_fn wrapper separates out any non-tokenized items).

IMO optimally the project could be reworked so that it was easier to call iteratively while parsing a file from a separate utility, or as a smaller change a parameter could be added to translate.py that specified a regex to select which lines to translate, regardless I didn't have the motivation to attempt a cleaner solution so didn't open a PR, but I do use this to translate SRT files so maybe it helps you.

EasyTranslate_retain-nontext.patch.txt

(put in the code directory and run patch -p1 <EasyTranslate_retain-nontext.patch.txt)

ikergarcia1996 / Easy-Translate

How to translate subtitle .srt #14