Closed CaptainTK closed 8 years ago
I could only find it it Most Popular. There is a fix in my master branch now.
Thanks, but this is not quite what I had in mind. The proposed fix only addresses this particular character, it does not provide a general solution.
Currently, my best candidate would be to use HTMLParser.unescape in OpenURL. This should solve this problem and also take care of all other replace-statements that we currently have.
After the slowness of the soup solution I didn't want to introduce another laggy library. Going through all the html escape codes might be expensive. I couldn't see any other escape sequences. It might be best to wait and see which ones they use. There are 365 according to this page: http://www.theukwebdesigncompany.com/articles/entity-escape-characters.php
Hope you will ever be able to enjoy a nice hot soup on a cold winter day again. ;-)
Seriously: I doubt that HTMLParser has such a bad performance hit. First of all, it has been around for ages and is much faster than BeautifulSoup. Secondly, unescaping HTML is a much simpler task than breaking down a complex page into its building blocks.
I will provide a prototype for testing later today and kindly ask you to run it on your Pi.
Sure I'll give it a run.
I don't think the bbc use too many html codes. All the strange welsh and scottish characters aren't encoded.
I haven't found any python code, but the html codes follow the ascii numbering so it should be a simple function to map the codes to chars. http://www.anglesanddangles.com/asciichart.php
@primaeval, I have committed the changes to the development branch. Please check if you can see any regression on your Pi boards.
Unfortunately, the problematic programme, "Looper" is no longer available in the meantime, so it is hard to test if this fix works as expected.
The speed is ok on the rpi: leisurely as before ;)
There is an "xmlcharrefreplace" option to .encode for html char codes but I didn't get it to work. Unicode is always in PIA.
Thanks, let's give it a try then. I will merge it into master.
ScrapeEpisodes currently does not handle special characters correctly. For instance, the single quotes in plot of the programme Looper are not converted from their HTML-representation ' to a single quote.
HTML- code like NN; should be converted to standard ASCII, ideally right when it is fetched in OpenURL.