tvgrabbers / tvgrabpyAPI

An xmltv-API for extracting and merging tv programme information from several sources
https://github.com/tvgrabbers/tvgrabpyAPI/releases/latest
GNU General Public License v3.0
27 stars 8 forks source link

Tvgids.nl HTML in description output #36

Closed mitchellklijs closed 5 years ago

mitchellklijs commented 5 years ago

Hi,

I've noticed that the tvgids.nl source includes HTML tags in the description output for programmes.

When running tv_grab_nl3.py without --disable-source 3 (3 = tvgids.nl), the output of for example the NOS Journaal looks like this:

  <programme  start="20190425000500 +0200" stop="20190425002500 +0200" channel="0-1">
    <title  lang="nl">NOS Journaal</title>
    <desc  lang="nl">&lt;html&gt;&lt;p&gt;Met het laatste nieuws, gebeurtenissen van nationaal en internationaal belang en de weersverwachting voor vandaag.&lt;/p&gt;&lt;p&gt;&lt;ul&gt;&lt;li&gt; 3,5 miljard voor nieuwe stations &lt;/li&gt;&lt;li&gt; Dodental Sri Lanka naar gestegen &lt;/li&gt;&lt;li&gt; Vier doden bij ongeluk op de A12 &lt;/li&gt;&lt;li&gt; Bodycam politie remt agressie &lt;/li&gt;&lt;li&gt; Angst voor meer schade Notre-Dame &lt;/li&gt;&lt;li&gt; Mildere straffen protesten Hongkong &lt;/li&gt;&lt;li&gt; Explosie in hotel Zaandijk &lt;/li&gt;&lt;li&gt; Hoofdkantoor Gasterra Groningen...</desc>
    <category>News</category>
  </programme>

Please notice the HTML tags included in the description.

When running tv_grab_nl3.py with --disable-source 3 (3 = tvgids.nl), the output of for example the NOS Journaal looks like this:

  <programme  start="20190425015500 +0200" stop="20190425021000 +0200" channel="0-1">
    <title  lang="nl">NOS Journaal</title>
    <desc  lang="nl">Met het laatste nieuws, gebeurtenissen van nationaal en internationaal belang en de weersverwachting voor vandaag.</desc>
    <date>2017</date>
    <category>News</category>
  </programme>

Please notice that the description looks different (without the HTML tags) because it's coming from a different source.

I suspect that the source file for tvgids.nl needs to be modified. I'm new to DataTreeGrab, so I don't quite understand what should be changed.

The tvgids.nl page for this programme is: https://www.tvgids.nl/tv/journaal/63510031.

I've tested this with the last stable release, as well as with beta-1.0.9

hikavdh commented 5 years ago

This is known, but I haven't had time jet to look into this. tvgids.nl is the source with the highest priority, so unless you set prefered_description for a channel to another source, it will be used. I have them mostly set to 7 (vpro.nl). That way you only get the tvgids.nl description if vpro.nl does not offer one.

mitchellklijs commented 5 years ago

Yeah! That would work fine for a workaround for now. Thanks 😉.

However, the information provided by tvgids.nl is the most detailed information I've encountered yet. So a fix would be nice of course!

hikavdh commented 5 years ago

The problem if I remember well is that those tags are not from their website, but are enclosed in the text. This possibly means I have to create extra functionality in tvgrabpyAPI to catch it and thus takes more time.

hikavdh commented 5 years ago

It needs an extra html decoding pass after decoding the page and grabbing the data or a very good regex for search and replace.

hikavdh commented 5 years ago

a regex could be placed in the datadef, so if you can think one up?

mitchellklijs commented 5 years ago

I understand. I've tried removing HTML with regex before, but there is always a case where it doesn't work...

The most simple regex I can think of is this <[^>]*>, which removes everything between < and >. This is maybe a bit to aggressive? As it could also remove non-HTML tags. A more restrictive approach could be by specifying common HTML tags. For example: <(?:html|\/html|div|\/div|br|p|\/p|ul|\/ul|li|\/li)[^>]*>. However, in this case we'll most likely miss some tags.

The best way to tackle this issue forever would indeed be to implement a HTML decoding mechanism.

hikavdh commented 5 years ago

Well it was simpler then I thought. It was mostly that I have had little time the last months. Early this winter tvgids.nl renewed their sites and I hadn't found time to do more then the basics. Now also the genres and the detail pages are working again.

mitchellklijs commented 5 years ago

I think you should also include ul and li tags for tvgids.nl (https://github.com/tvgrabbers/sourcematching/blob/master/sources/source-tvgids.nl.json#L62).

I've just tested the new release, but these tags aren't removed:


  <programme  start="20190429022500 +0200" stop="20190429023000 +0200" channel="0-1">
    <title  lang="nl">NOS Journaal</title>
    <desc  lang="nl">Met het laatste nieuws, gebeurtenissen van nationaal en internationaal belang en de weersverwachting voor vandaag. &lt;ul&gt;&lt;li&gt; Dode bij aanslag op synagoge in de VS &lt;/li&gt;&lt;li&gt; Eerste toeristen terug uit Sri Lanka &lt;/li&gt;&lt;li&gt; Agent Maastricht aangereden &lt;/li&gt;&lt;li&gt; Doden bij kraanongeval in Seattle &lt;/li&gt;&lt;li&gt; Het weer&lt;/li&gt;&lt;/ul&gt;</desc>
    <date>2017</date>
    <category>News</category>
    <previously-shown/>
  </programme>
hikavdh commented 5 years ago

Thanks, and now to determine with what to replace to keep it readable. I can't use new lines, so I guess it will become a ; separated list, maybe enclosed in brackets.

mitchellklijs commented 5 years ago

Yeah, would probably be best. Maybe add some other common tags as well (https://www.w3schools.com/tags/)?

One other consideration. Right now tags with attributes wouldn't be removed. For example:

<p style="xxxx"></p>

I've never encountered a situation with Tvgids.nl yet where this would be necessary, but it might become in the future?

hikavdh commented 5 years ago

These are only basic layout tags. It comes from a text field inside a json data page, so it should not be more as else it could interfere with the frontend using the data. So definitely no style data

hikavdh commented 5 years ago

Added lu li. It is a set of re.sub statements:


"sub": ["<html>", "", "</html>", "", 
    "\\s*</p><p>\\s*", " ", "<p>", "", "</p>", "", 
    "\\s*<ul>", " (", "</ul>\\s*", ") ", 
    "\\s*</li><li>\\s*", "; ", "<li>\\s*", "", "\\s*</li>", ""]
mitchellklijs commented 5 years ago

Yeah, that's true! Forgot about that.

hikavdh commented 5 years ago

Oh and one tip. With tv_grab_nl3.py --clear-source 3 you remove all data from tvgids.nl from your database. The current day is always freshly fetched, but days further in the future are retrieved from your database and it will take 2 weeks for all html tagged descriptions to disappear. It will next take several fetches to come up to 14 days as a max of 3 or 4 days is fetched every time. But faster then 14 days.