OpenURL should unescape special characters in HTML code

vonH / plugin.video.iplayerwww

BBC iPlayer for Kodi

GNU General Public License v2.0

44 stars 24 forks source link

OpenURL should unescape special characters in HTML code #56

Closed CaptainTK closed 8 years ago

CaptainTK commented 8 years ago

ScrapeEpisodes currently does not handle special characters correctly. For instance, the single quotes in plot of the programme Looper are not converted from their HTML-representation ' to a single quote.

HTML- code like &#NN; should be converted to standard ASCII, ideally right when it is fetched in OpenURL.

primaeval commented 8 years ago

I could only find it it Most Popular. There is a fix in my master branch now.

CaptainTK commented 8 years ago

Thanks, but this is not quite what I had in mind. The proposed fix only addresses this particular character, it does not provide a general solution.

Currently, my best candidate would be to use HTMLParser.unescape in OpenURL. This should solve this problem and also take care of all other replace-statements that we currently have.

primaeval commented 8 years ago

After the slowness of the soup solution I didn't want to introduce another laggy library. Going through all the html escape codes might be expensive. I couldn't see any other escape sequences. It might be best to wait and see which ones they use. There are 365 according to this page: http://www.theukwebdesigncompany.com/articles/entity-escape-characters.php

CaptainTK commented 8 years ago

Hope you will ever be able to enjoy a nice hot soup on a cold winter day again. ;-)

Seriously: I doubt that HTMLParser has such a bad performance hit. First of all, it has been around for ages and is much faster than BeautifulSoup. Secondly, unescaping HTML is a much simpler task than breaking down a complex page into its building blocks.

I will provide a prototype for testing later today and kindly ask you to run it on your Pi.

primaeval commented 8 years ago

Sure I'll give it a run.

I don't think the bbc use too many html codes. All the strange welsh and scottish characters aren't encoded.

I haven't found any python code, but the html codes follow the ascii numbering so it should be a simple function to map the codes to chars. http://www.anglesanddangles.com/asciichart.php

CaptainTK commented 8 years ago

@primaeval, I have committed the changes to the development branch. Please check if you can see any regression on your Pi boards.

Unfortunately, the problematic programme, "Looper" is no longer available in the meantime, so it is hard to test if this fix works as expected.

primaeval commented 8 years ago

The speed is ok on the rpi: leisurely as before ;)

There is an "xmlcharrefreplace" option to .encode for html char codes but I didn't get it to work. Unicode is always in PIA.

CaptainTK commented 8 years ago

Thanks, let's give it a try then. I will merge it into master.