tvgrabbers / tvgrabnlpy

Deze versie is deprecated zie: tvgrabpyAPI
https://github.com/tvgrabbers/tvgrabpyAPI
GNU General Public License v2.0
27 stars 8 forks source link

Thoughts on enhancements #52

Closed hikavdh closed 8 years ago

hikavdh commented 8 years ago

A more properly named issue to continue #49 Error with some ids

kyl416 commented 8 years ago

For VRT you just need to have "application/vnd.epg.vrt.be.schedule_3.1+json" in the Accept header and it delivers it as json http://services.vrt.be/epg/schedules/thisweek?channel_code=O9&type=week

hikavdh commented 8 years ago

I'll look at it, but now I have to find some sleep. Tomorrow I have to resolve some Windows profile issues and create redundancy for next time.

hikavdh commented 8 years ago

I've added a no_genric_matching table to sourcematching. You can add IDs per source. I have still to implement the code. I will for those source/ID combinations disable everything but time/title matching. So also group-slots and split-episodes. If you haven't jet noticed humo.be is down! I took the opportunity to add code to make configure fail in such an instance. You force it to complete by disabling the source.

hikavdh commented 8 years ago

Oh, and I noticed that also for Nicelodeon the timings are off on Horizon. In the past Horizon had a bad name for accuracy, etc., but I thought that had improved. On the opposite, the three Belgium sources seem to be very much in consensus.

hikavdh commented 8 years ago

https://github.com/tvgrabbers/tvgrabnlpy/releases/tag/alfa-2.2.8-p20151222 This should disable genre and split episode matching for the source/channel combinations in the no_genric_matching table. CC on Horizon is already in that table. Groupslot detection disabling is more tricky, so I leave that for now.

hikavdh commented 8 years ago

I will add that the IDs in no_genric_matching will be put at the end of the source list for that channel. This will mean that you can not set it as Prime_source (unless of cause it's the only source. Let me know if this works or that groupslots also need to be disabled. As said that's tricky. In the merge procedure, those are taken out of the listings and later put through a separate comparison against the remainder of the other listings. I'm afraid for unexpected side-effects, so I rather leave that out.

I've also been thinking about Nickelodeon. To merge them back together while not changing chanids/xmltvids I have to introduce chanid-aliases. I can create a table for those, but I rather create it more general; a user settable option to set an alias for a channel to use as xmltvid. Next to that we can create a table with phasing out chanids to automatically create an alias on running --configure when that chanid is found active. That way aliases for a user stay in use until he/she changes his configuration, even if we remove any chanid from that table after a certain time. I'm thinking of adding an end date to the table, so we then can remove it. Else it will get to crowded, like is starting to happen with the empty_channel list. There are already more then 10 no longer existing IDs in there.

hikavdh commented 8 years ago

Oh, and if you think other source/channel combinations would benefit, feel free to add them!

hikavdh commented 8 years ago

https://github.com/tvgrabbers/tvgrabnlpy/releases/tag/alfa-2.2.8-p20151223 Added the exclusion from prime_source and the new xmltvid_alias option. I have to work on configure.

hikavdh commented 8 years ago

https://github.com/tvgrabbers/tvgrabnlpy/releases/tag/alfa-2.2.8-p20151227 Added another option legacy_xmltvids clearing further the way for pushing xmltvid_aliases through source_matching. I documented both options in the WIKI. I have to add the code for that, but I think that before we can re-merge the Nicelodeon channels, we have to wait at least a week after releasing the version. In between I set the prime_source to 1.
You got any testing done? Or did the holidays come in between? ;-)

kyl416 commented 8 years ago

I haven't had time for any in depth testing

I did spot this one error from tvgids.tv, not sure if it's a parsing problem for this specific program, but it has popped up on several grabs in the past few days:

Error extracting ElementTree from:http://www.tvgids.tv/tv/trips-travel/14792796 on tvgids.tv
hikavdh commented 8 years ago

I guessed as much ;-) Those errors come from errors in their html encoding. Most common are not encoded double quotes in titles (notoriously in Classical music titles on Brava) or tags partially placed inside tags. tvgids.tv is very sloppy. I catch some and are now and then thinking about further algorithms to catch them, but it doesn't have the highest priority and is quite complex.

hikavdh commented 8 years ago

In raw-output you'll find the offending text and the exact location within.

hikavdh commented 8 years ago

https://github.com/tvgrabbers/tvgrabnlpy/releases/tag/alfa-2.2.8-p20151230 It took me a lot of thinking, but I have made a framework for remerging chanids. See the "merge_into" table in sourcematching. "1-nickelodeon":{"chanid":"0-89", "sources":{"0":"89"}, "date":"20160101"}} This results in the following actions on running configure:

At present we can not jet set source 0 as prime_source for Nickelodeon as the file is also used by older versions, so 1 has to do. If not running configure and the merging chanid "0-89" is found, combined_channels is checked and ad-hoc updated and the ids from 0-89 are added to 1-nickelodeon. I also added a date field. This is not used, but gives us a clue on when to permanetly update sourcematching.json. I think after 2 or 3 months.

hikavdh commented 8 years ago

I added an extra message field to explain. It's not jet in the above version.

kyl416 commented 8 years ago

I'm getting the following error after the latest changes:

An unexpected error has occured:
Traceback (most recent call last):
  File "tv_grab_nl.py", line 13101, in main
    x = config.validate_commandline()
  File "tv_grab_nl.py", line 2244, in validate_commandline
    x = self.get_sourcematching_file(self.args.configure)
  File "tv_grab_nl.py", line 2175, in get_sourcematching_file
    self.channels(newch).source_id[int(source)] = id
TypeError: 'dict' object is not callable
hikavdh commented 8 years ago

Oops, that one I missed. I had used '()' in stead of '[]' and had copied that part over several times. I thought I corrected all. You checked without running --configure ;-) Download again in a minute

hikavdh commented 8 years ago

Updated the tag again. Found some more. I only tested with --configure! ;-(

hikavdh commented 8 years ago

I did look again at vrt.be. but what accept header do you mean? http://services.vrt.be/epg/schedules/thisweek?channel_code=O9&type=week just gives a list of available formats.

hikavdh commented 8 years ago

Or better said, I guess I can find how to do it in Python, but can I do it in an ordinary browser, say Firefox? Else it becomes cumbersome to write the code as I have to get the output through Python always.

hikavdh commented 8 years ago

I added the vrt.be channels to source channels. Can you check on the right merges? I'm especially wondering about ketnet/één+/canvas+ as they are together in combined channels and radio 2 and possibly Klara. I'll post an alfa if I have the get channels part ready

kyl416 commented 8 years ago

I'm not sure how to set custom headers with python, but with curl and wget you do something like this:

curl http://services.vrt.be/epg/schedules/thisweek?channel_code=O9&type=week -H "Accept: application/vnd.epg.vrt.be.schedule_3.1+json"
wget http://services.vrt.be/epg/schedules/thisweek?channel_code=O9&type=week --header="Accept: application/vnd.epg.vrt.be.schedule_3.1+json"
hikavdh commented 8 years ago

https://github.com/tvgrabbers/tvgrabnlpy/releases/tag/alfa-2.2.8-p20160101 In Python I just add it to the dict also containing the user agent, but it would be nice if I could just call it in firefox in stead of calling on the command-line, piping to a file and then opening the file. The above alfa contains get_channels for vrt.be. I have to work further on the listings. It for now just ignores the vrt.be ids on grabbing.

hikavdh commented 8 years ago

Oh, and I tagged a beta with all previous updates!

kyl416 commented 8 years ago

The following entries from VRT are inactive, you won't get any data from them, they are just historical entries for previous channels that either merged with another channel, rebranded or no longer exist: 04 Ketnet Alternatief 05 De Overname 14 Jazz Middelheim 15 Radio 1 Classics 30 Radio 2 De Topcollectie XL 33 Klara Jazz 42 Studio Brussel Rock It! 51 Donna (Now 55 MNM) 52 Donna Hitbits (Now 56 MNM Hits) 61 Radio Vlaanderen Internationaal 62 Radio Vlaanderen N7 één+ (Now part of a combined listing with O9 Ketnet) O7 Ketnet+&Canvas+ (Now part of a combined listing with O9 Ketnet)

The 1-ketnet-canvas-2 listing on tvgids.tv is a leftover from when there was a Ketnet+ sharing with Canvas+. It shows the same listings as 1-ketnet-op12, except it only goes out for 1 or 2 days, while 1-ketnet-op12 has more days. 1-ketnet-op12 is currently listed as in empty_channels, so we should probably list 1-ketnet-canvas-2 as empty instead since all it does is give you 1 day worth of listings for Ketnet/één+/Canvas+.

I would keep VRT's Radio 2 sources seperate from VPRO's Radio 2. VPRO just has a generic "Radio 2 Regionaal" program during local hours, while VRT includes details for those local shows. VRT's website defaults to Vlaams-Brabant (22) for Radio 2, so if you want to merge them, I would select that, just make sure to take the name from VRT so people using it know that's the regional entry for Vlaams-Brabant, and either add "Radio 2 Regionaal" in groupslot_names so it doesn't override the local shows or make VRT the prime_source.

Depending on how much details you are able to get from their API, you might want to make it a prime_source for all of VRT's stations.

This is how the sources breakdown: Eén: 0-5 1-een 5-24443943058 6-22 7-een 8-een 9-een 10-O8

Canvas: 0-6 1-ketnet-canvas 5-555680807173 6-18 7-vrt_canvas 8-canvas 9-canvas 10-1H

Ketnet/één+/Canvas+: 1-ketnet-op12 (needs to be removed from empty_channels) 5-24443943087 6-59 7-ketnet 10-O9

Ketnet Only: 8-ketnet 9-ketnet

één+ Only: 8-eenplus

VRT Radio 1 (VRT just calls it Radio 1, so you might want to do a rename to VRT Radio 1 so people know its for VRT): 7-vrt_radio_1 10-11

Klara: 7-klara 10-31

kyl416 commented 8 years ago

Also if you can detect the active/inactive state from the json, I would use that to determine which channels to include

{

    "code": "1H",
    "name": "Canvas",
    "displayName": "Canvas",
    "eid": "46162538",
    "type": "tv",
    "state": "active",
    "description": "",
    "radioplayerUrl": null,
    "websiteUrl": "http://www.canvas.be/",
    "logoUrl": "http://images.vrt.be/height100/logo/canvas/CANVAS_logo_lichtblauw.jpg",
    "streamsLink": 

{

    "rel": "http://services.vrt.be/channel/rel/channel/streams",
    "href": "http://services.vrt.be/channel/s/1H/streams"

},
"imagesLink": 
{

    "rel": "http://services.vrt.be/rel/images",
    "href": "http://services.vrt.be/channel/s/1H/images"

},
"thirdpartyLinksLink": 
{

    "rel": "http://services.vrt.be/rel/thirdpartylinks",
    "href": "http://services.vrt.be/channel/s/1H/thirdpartylinks"

},
"detailLink": 

    {
        "rel": "http://services.vrt.be/channel/rel/channel",
        "href": "http://services.vrt.be/channel/s/1H"
    }

}

vs

{

    "code": "05",
    "name": "De Overname",
    "displayName": "De Overname",
    "eid": "05",
    "type": "radio",
    "state": "inactive",
    "description": "",
    "radioplayerUrl": null,
    "websiteUrl": "http://www.deovername.be/",
    "logoUrl": "http://services.vrt.be/images/height100/logos/vrt_grey.png",
    "streamsLink": 

{

    "rel": "http://services.vrt.be/channel/rel/channel/streams",
    "href": "http://services.vrt.be/channel/s/05/streams"

},
"imagesLink": 
{

    "rel": "http://services.vrt.be/rel/images",
    "href": "http://services.vrt.be/channel/s/05/images"

},
"thirdpartyLinksLink": 
{

    "rel": "http://services.vrt.be/rel/thirdpartylinks",
    "href": "http://services.vrt.be/channel/s/05/thirdpartylinks"

},
"detailLink": 

    {
        "rel": "http://services.vrt.be/channel/rel/channel",
        "href": "http://services.vrt.be/channel/s/05"
    }

}
hikavdh commented 8 years ago

I saw the inactive tag, but thought it just meant 'no programming at present'. I'll add an exclusion on that tag, so we don't need to set them in empty_channels. If we now already set prime_source on 10 it means that pre 2.2.8 users will fall back to prime_source_order for determining the prime_source. It might even create errors. I think I added an ignore for not existing sources, but I have to check. If we switch 1-ketnet-canvas-2 for 1-ketnet-op12 we have to keep 1-ketnet-canvas-2 as chanid, so the xmltvid does not change. Feel free to do that. So updating source_channels[1]["1-ketnet-canvas-2"] and empty_channels. The naming of Radio 1/Radio 2 is clear by the grouping in Radio Vlaams, so I think no need to rename.

hikavdh commented 8 years ago

But if you think renaming better, again feel free to add those entries.

hikavdh commented 8 years ago

I'm wondering, they give start end end time in the seconds (GMT)

            "startTime":"2015-12-28T05:00:08.000Z",
            "endTime":"2015-12-28T05:05:15.000Z",

and the next starttime:

            "startTime":"2015-12-28T05:05:23.000Z"

And that while they are notoriously often starting to late or to early. Sometimes more then 15 min. I always schedule them with broader margins.

kyl416 commented 8 years ago

Which channel is that for?

hikavdh commented 8 years ago

Ketnet

kyl416 commented 8 years ago

Is there another example of something in the future so I can compare with others?

hikavdh commented 8 years ago

In your example url you used type = week, this seems to always mean Monday to Sunday in the running week. Alternately type = day. I see also option view = week or month and an option I do not know what it does: cascading and option date. Did you experiment how to get data past the running week and on the syntax for date?

hikavdh commented 8 years ago

For today I don't see the seconds, maybe they update afterwards? "startTime":"2016-01-02T05:00:00.000Z" "endTime":"2016-01-02T05:05:00.000Z" "title":"Hopla"

"startTime":"2016-01-02T05:05:00.000Z"

hikavdh commented 8 years ago

But this should be accurate as it's the starting show of the day.

kyl416 commented 8 years ago

I didn't really experiment beyond finding the list of channels and the corresponding epg data.

I think you can use this to specify the specific days you want: http://services.vrt.be/epg/schedules/20160103?type=day&channel_code=O9

It might be easier to just keep on advancing until there's no more listings available.

hikavdh commented 8 years ago

However later today:

"startTime":"2015-12-28T15:16:51.000Z"
"endTime":"2015-12-28T15:42:46.000Z"
"title":"Mega Mindy"

and the next starttime

"startTime":"2015-12-28T15:45:06.000Z"
hikavdh commented 8 years ago

My current listing has that one from 16:25 to 16:50 CET

kyl416 commented 8 years ago

I wonder if it's related to how they also record pretty much every episode internally and post some of them to their kijken player and the gap is ads/filler. If so it might be safe to extend to the start of the next program. I don't have access to a stream of Ketnet, maybe you can do some checking later to see what's closer to what actually aired.

hikavdh commented 8 years ago

No I'll add the gap like in the original NPO list as 'add/anouncements' or 'Programmainfo en Reclame'

kyl416 commented 8 years ago

Does Ketnet have ads like Zapp between shows, or is the filler just promos and hosted segments like CBeebies and CBBC?

hikavdh commented 8 years ago

I never watch it, but één and canvas do have leading and closing adds like 'this is brought to you by....'

hikavdh commented 8 years ago

I'll round starttime down to the nearest minute and endtime up to the nearest minute. I think seconds will get ignored by most.

kyl416 commented 8 years ago

Also, I pushed the changes, along with a second one to add the empty_channels for VRT. VRT seems to have better logos for somethings, is there an easy way to include that?

Like the RADIO2_RED_RGB.png is better than the one we currently have for 7-vrt_radio_2

hikavdh commented 8 years ago

Thanks! They are.just run configure

kyl416 commented 8 years ago

I mean specify so it will use that logo instead of the current one. I already ran configure and this is what the string says:

VRT Radio 2;12;7-vrt_radio_2;;;;;;;;vrt_radio_2;;;;4;radio_vrt2.png

Is there a way to change sourcematching.json so it uses the same logo as the other Radio 2 entries by default?

11;radio2/RADIO2_RED_RGB.png
kyl416 commented 8 years ago

Also, I think you were looking at the wrong day, for the next airing of Mega Mindy on 1/2 I see the following:

"startTime": "2016-01-02T15:35:00.000Z"
"endTime": "2016-01-02T16:00:00.000Z"
"title": "Mega Mindy"

The show after that says this:

"startTime": "2016-01-02T16:00:00.000Z"
"endTime": "2016-01-02T16:17:00.000Z"
"title": "Welkom in de Wilton"

It seems the precise to the second timeslots are only for shows that aired earlier

hikavdh commented 8 years ago

For Radio 2 as it does not have a vrt id you have to add it to logo_names. The icon source is allready there so it will work for 2.2.7 users too. Change entry:

        "7-vrt_radio_2": ["4", "radio_vrt2"],

to

        "7-vrt_radio_2": ["11", radio2/RADIO2_RED_RGB.png"],
hikavdh commented 8 years ago

You're right I got lost in that long listing. But still it is 10 minutes later than in my present listing grabbed last night! So they hopefully are more accurate!

kyl416 commented 8 years ago

For some reason it's not accepting the change, after I run configure it doesn't have a logo at all:

VRT Radio 2;12;7-vrt_radio_2;;;;;;;;vrt_radio_2;;;;-1;

I also tried adding VRT's een+ logo to 8-eenplus since the one from Nieuwsblad is the een logo, but it still is showing the one from nieuwsblad:

een+;2;8-eenplus;;;;;;;;;eenplus;;;8;eenplus.png
kyl416 commented 8 years ago

n/m on the eenplus one, since it's an old entry it's on a different url