Open ewwink opened 7 months ago
Had a similar need and the issue ofc boils down to EasyTranslate requiring that every line in the input file is translatable.
Attached patch makes it so that when a line contains only numbers and/or non-alphabetical characters it is not translated but pulled aside and then printed back out during output phase (maybe there's a cleaner way but it appears that whatever is added to the pytorch Dataset structure has to be compatible with accelerator.prepare() so as workaround a collate_fn wrapper separates out any non-tokenized items).
IMO optimally the project could be reworked so that it was easier to call iteratively while parsing a file from a separate utility, or as a smaller change a parameter could be added to translate.py that specified a regex to select which lines to translate, regardless I didn't have the motivation to attempt a cleaner solution so didn't open a PR, but I do use this to translate SRT files so maybe it helps you.
EasyTranslate_retain-nontext.patch.txt
(put in the code directory and run patch -p1 <EasyTranslate_retain-nontext.patch.txt
)
I use this command
with input.srt
but the result.srt has problems:
(dalam bahasa Inggris)