ikergarcia1996 / Easy-Translate

Easy-Translate is a script for translating large text files with a SINGLE COMMAND. Easy-Translate is designed to be as easy as possible for beginners and as seamlesscustomizable and as possible for advanced users.
Apache License 2.0
189 stars 307 forks source link

How to translate subtitle .srt #14

Open ewwink opened 7 months ago

ewwink commented 7 months ago

I use this command

python3 translate.py \
--sentences_path input.srt \
--output_path result.srt \
--source_lang eng_Latn \
--target_lang ind_Latn \
--model_name facebook/nllb-200-distilled-600M \
--precision fp16

with input.srt

1
00:00:07,312 --> 00:00:09,993
Hello.

2
00:00:09,994 --> 00:00:11,227
Where are you right now?

3
00:00:11,228 --> 00:00:13,360
Right now I am on my way
to South Dakota.

4
00:00:13,361 --> 00:00:16,093
Gonna do a little camping,
do a little fishing.

5
00:00:16,094 --> 00:00:17,426
Good for you, Colter.

but the result.srt has problems:

1
00:00:07,312 --> 00:00:09,993
Hei, apa yang kau lakukan?
(dalam bahasa Inggris) <-- this should be empty line
2 (satu) <-- the '(satu)' should not be exist
00:00:09,994 --> 00:00:11,227
Di mana kau sekarang?
(dalam bahasa Inggris) ....
3 Pemberantasan Korupsi <-- this also should not be exist
00:00:11,228 --> 00:00:13,360
Saat ini aku sedang dalam perjalanan
ke Dakota Selatan.
(dalam bahasa Inggris) ...
4
00:00:13,361 --> 00:00:16,093
Akan pergi berkemah sedikit,
lakukan sedikit memancing.
(dalam bahasa Inggris) ...
5
00:00:16,094 --> 00:00:17,426
Bagus untukmu, Colter.
(dalam bahasa Inggris) ...
stt commented 6 months ago

Had a similar need and the issue ofc boils down to EasyTranslate requiring that every line in the input file is translatable.

Attached patch makes it so that when a line contains only numbers and/or non-alphabetical characters it is not translated but pulled aside and then printed back out during output phase (maybe there's a cleaner way but it appears that whatever is added to the pytorch Dataset structure has to be compatible with accelerator.prepare() so as workaround a collate_fn wrapper separates out any non-tokenized items).

IMO optimally the project could be reworked so that it was easier to call iteratively while parsing a file from a separate utility, or as a smaller change a parameter could be added to translate.py that specified a regex to select which lines to translate, regardless I didn't have the motivation to attempt a cleaner solution so didn't open a PR, but I do use this to translate SRT files so maybe it helps you.

EasyTranslate_retain-nontext.patch.txt

(put in the code directory and run patch -p1 <EasyTranslate_retain-nontext.patch.txt)