pndurette / gTTS

Python library and CLI tool to interface with Google Translate's text-to-speech API
http://gtts.readthedocs.org/
MIT License
2.33k stars 361 forks source link

0xA0 is causing gtts-cli to send EOF. #353

Open medanisjbara opened 2 years ago

medanisjbara commented 2 years ago

Prerequisites

Current Behaviour (steps to reproduce)

The presence of 0xA0 in the input text is mostly ignored by gtts-cli. But in certain situations (the provided example) It will produce Error: 200 (OK) from TTS API. Probable cause: No audio stream in response. Unsupported language 'en' along with EOF (And it seems to be redirected to stderr without actually having a python error).

$ gtts-cli -f test -o test.mp3

working_test.txt non_working_test.txt Even though the files contain 0xA0 which I assumed it will make the file a binary file. The file command says the opposite.

$ file non_working_test.txt
non_working_test.txt: Unicode text, UTF-8 text

gtts-cli didn't complain about none UTF-8 characters. And using iconv to remove non utf-8 characters doesn't change anything. $ iconv -f utf-8 -t utf-8 -c test does nothing to the file. And some web pages use that character in between the text. Most text editors show it as space. Which is a bit frustrating to the user (You almost have no clue what to do or what causes the error) And I can not blame the creator of the page since it seems like (after searching online) 0xA0 is a part of windows-1252 encoding (So if he wrote his blog in microsoft word, there's a big chance it got introduced there).

Expected Behaviour

gtts-cli should ignore that character and continue reading regardless of how and where it is present.

Context

I am writing a simple bash script that reads aloud the user's clipboard or a webpage associated with the url in the user's clipboard.
I personally have been using this command w3m "$(xclip -o)" | gtts-cli -f - | mpv - for over a year to boost productivity when reading. With some variations such less $pdf_file_or_epub_file | gtts-cli -f - | mpv - and so on and so forth.
The script basically does the same (Still very basic and under development).
And I came accross some webpages that caused that error to occure. After Some investigations I found out that the character 0xA0 is what is causing the problem.
So I created an issue and made a small workaround that uses bbe to replace the bad character with none (and then iconv for clean up since it is messing up a couple of things).

Environment

$ gtts-cli --version
gtts-cli, version 2.2.4

$ python --version
Python 3.9.12

$ uname -a
Linux Laptop 5.17.3-tkg-pds #1 TKG SMP PREEMPT Sat Apr 16 06:53:55 CET 2022 x86_64 Intel(R) Celeron(R) N4000 CPU @ 1.10GHz GenuineIntel GNU/Linux
medanisjbara commented 2 years ago

I assume this isn't gtts-cli's fault. Since there's no actual python error. So I assume the problem is actually with the google text to speech engine. Yet the behavior itself is confusing. So I hope a fix will be applied.

pndurette commented 2 years ago

@medanisjbara Thanks a lot for this well documented behaviour!

Hmm, so it's a windows-1252 character. I wonder if there's anything gTTS should (or shouldn't do) about this, like applying some filtering. I'll have to take a look with the debugging on.