medanisjbara / clipboard-reader

1 stars 0 forks source link

Removing 0xA0 from certain websites. #1

Open medanisjbara opened 2 years ago

medanisjbara commented 2 years ago

Some websites use 0xA0 along with some occurences of \n, which when passed to gtts-cli along with the rest of the text, will cause it to crush and output this message Error: 200 (OK) from TTS API. Probable cause: No audio stream in response. Unsupported language 'en' Followed by EOF. Therefore the script works until the first encounter of 0xA0.

To the user, it seems like the script just stops at a certain point in the page (the same point every time the script is executed again). And detecting 0xA0 isn't possible with the vim editor.

medanisjbara commented 2 years ago

A possible workaround is to add an option to remove \xA0 from content, I'm working on implementing it using the bbe command. This leads us to another problem which is the fact that bbe gets binary input and gives binary output. Which makes a lot of the unicode characters none-unicode. It isn't a big deal since we can remove those using the iconv command, but this will remove a lot of symbols from the page that could've been readable by gtts-cli.

This is the main reason why I will make this an optional behavior associated with the -u flag.

medanisjbara commented 2 years ago

After this commit, the script now has an option to remove undesired characters from the web page. Now we need to document this in README, and we might want to consider trying to make a PR to gTTS since they might see this as something useful.

medanisjbara commented 2 years ago

I'm finished with the documentation on README. But I found out that sublime text does show 0xA0 in text files. And After doing some tests, it seems like One of the pages I was refering to contains a lot of the 0xA0. I also wrote a test file and tried gtts-cli on it just to find out that it indeed does ignore 0xA0.

medanisjbara commented 2 years ago

After doing a little bit of investigation. The problem is indeed 0xA0, but it seems like it causes the error to gTTS in certain conditions (that I will specify in the issue I will create on thier repository)

medanisjbara commented 2 years ago

It seems like the 0xA0 happens to exist on websites online more frequently than I thought. And even though most of the times it goes unnoticed. Some times one single occurence in the right plance can cause gtts-cli to crush. A good example of this is the LPI website where there are only 6 occurences of the character (all really spread out). And The first character to be encoutered is causing the mentioned error. I assume I will make the filtering of this character the default behavior since it's existance doesn't serve any practical purpose.

medanisjbara commented 2 years ago

Another website also uses this character which is causing the same problem. I'm not sure if collecting a list of websites that are causing the error is helpful. Since the problem is sort of identified. I'm still curious why this character is being added to websites in the first place. And why is gTTS not okay with it. The second part is being worked on by the auther of gTTS. I assumed (at first) that the character is being introduced by the fact that the writer is writing this in microsoft word since it is a part of the windows-1252 encoding. But some websites seem to be linux oriented or written by people advanced enough that they will probably won't use MS word and such. Even though I feel like further investigations are needed. I don't think this is my problem. But as I mentioned earlier. I may make the filtering be the default behavior of the script.