shashikg / WhisperS2T

An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine
MIT License
318 stars 32 forks source link

Non latin characters cannot get exported to files #53

Closed EricBizet closed 4 months ago

EricBizet commented 8 months ago

When exporting a transcript in Japanese I got:

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/whisper_s2t/utils.py:95, in ExportVTT(transcript, file, single_sentence_in_one_utterance, end_punct_marks)
     93 f.write("WEBVTT\n\n")
     94 for _utt in transcript:
---> 95     f.write(f"{format_timestamp(_utt['start_time'])} --> {format_timestamp(_utt['end_time'])}\n{_utt['text']}\n\n")

UnicodeEncodeError: 'ascii' codec can't encode characters in position 24-25: ordinal not in range(128)

Proposing a fix for exporting results as .srt, .txt, etc. files https://github.com/shashikg/WhisperS2T/pull/52