tvgrabbers / tvgrabnlpy

Deze versie is deprecated zie: tvgrabpyAPI
GNU General Public License v2.0
27 stars 8 forks source link

Am I missing something, npo3 programs not in xml output #62

Closed pensionado closed 8 years ago

pensionado commented 8 years ago

First of all I made a modification to the script so I could find the match details (would like to recommend this: match_array = [ 'Match details for %s:\n' % (config.channels[chanid].chan_name)]).

An extract from xml output file shows that programs for npo3 are missing from early in the morning on 20160430 till some time later (refer to npo3_entries.txt).

Details from the log show (npo3_matchlog.txt) that many of the missing entries were added from (and others later in the log), so why aren't they in the output.

So am I missing something or have I hit a bug ? npo3_entries.txt npo3_matchlog.txt

hikavdh commented 8 years ago

I don't understand. As far as I can see I'm not missing anything and if you want the match details that is a matter of the log_level including 32 and setting the match_log_level as desired The added part is present in the statistics header It is true that some of the (mainly Flemish) sources return a groupslot for the programming outside Primetime. So are you referring to that? Also in you log excerpt I see is only covering a much shorter timespan, so you probably had some failures on that source.

hikavdh commented 8 years ago

OK I looked at my log from this early morning: 2016-04-29 07:13:14 : Fetch statistics for 21063 programms on 84 channels: 2016-04-29 07:13:14 : Start time: 2016-04-29 05:53 2016-04-29 07:13:14 : End time: 2016-04-29 07:13 2016-04-29 07:13:14 : Duration: 1:19:42.747715 2016-04-29 07:13:14 : 4373 page(s) fetched, of which 211 failed 2016-04-29 07:13:14 : 9238 cache hits 2016-04-29 07:13:14 : 72 succesful lookups 2016-04-29 07:13:14 : 76 failed lookups 2016-04-29 07:13:14 : Time/fetch: 1.09369945461 seconds 2016-04-29 07:13:14 : 51 page(s) fetched from 2016-04-29 07:13:14 : 0 failure(s) on 2016-04-29 07:13:14 : 4 base page(s) fetched from 2016-04-29 07:13:14 : 2446 detail page(s) fetched from 2016-04-29 07:13:14 : 18 failure(s) on 2016-04-29 07:13:14 : 160 base page(s) fetched from 2016-04-29 07:13:14 : 689 detail page(s) fetched from 2016-04-29 07:13:14 : 59 failure(s) on 2016-04-29 07:13:14 : 1 base page(s) fetched from 2016-04-29 07:13:14 : 0 failure(s) on 2016-04-29 07:13:14 : 7 base page(s) fetched from 2016-04-29 07:13:14 : 1 failure(s) on 2016-04-29 07:13:14 : 0 base page(s) fetched from 2016-04-29 07:13:14 : 136 failure(s) on 2016-04-29 07:13:14 : 16 base page(s) fetched from 2016-04-29 07:13:14 : 0 failure(s) on 2016-04-29 07:13:14 : 5 base page(s) fetched from 2016-04-29 07:13:14 : 0 failure(s) on 2016-04-29 07:13:14 : 84 base page(s) fetched from 2016-04-29 07:13:14 : 0 failure(s) on 2016-04-29 07:13:14 : 5 base page(s) fetched from 2016-04-29 07:13:14 : 722 detail page(s) fetched from 2016-04-29 07:13:14 : 2 failure(s) on 2016-04-29 07:13:14 : 15 base page(s) fetched from 2016-04-29 07:13:14 : 1 failure(s) on 2016-04-29 07:13:14 : 0 base page(s) fetched from virtual 2016-04-29 07:13:14 : 0 failure(s) on virtual 2016-04-29 07:13:14 : 7 base page(s) fetched from 2016-04-29 07:13:14 : 3 failure(s) on As you can see for me had (unlike yesterday) a total failure, so may their site is down!

pensionado commented 8 years ago

I can send you the full log and xml file. What surprises me though is that although the extract from seems to have a lot of missing data, why the add as indicated from the matchlog didnt add them, although the message said it did, so who is misleading who.

hikavdh commented 8 years ago

OK I looked a bit deeper at your npo3_entries.txt file. I see this entry: <programme start="20160502032700 +0200" stop="20160430063000 +0200" channel="0-3"> which starts before it stops. So yes sent me those two and your configuration file. Also what is the platform your using? Linux, Windows, ...?

hikavdh commented 8 years ago

Some of the sources do not give an end time, so we have to get them from the next program. For the last program it is set somewhere in the next morning. My guess is that somehow as a result of a gap in one of those sources went wrong. So that then definitively is a bug.
In the new version 3 under construction, these things are handled differently, but it can be some months before that one is fully stable. So I will see if I can find the cause. Of cause if next run you do not have such a failure, it would go OK, as it is in my output.

pensionado commented 8 years ago

Have seen this problem now for 2 days in a row, so I started to dig deeper, but am not familiar enough with the program to find the problem. Running on Linux (Opensuse 13.2) and here are the files from the reproduction run this morning.

hikavdh commented 8 years ago

The program I indicated should get removed in parse_programs, but somehow instead it results in those others getting removed. I see if this weekend I can find how it got created and why it did not get removed.

pensionado commented 8 years ago

Found these entries (last 4):

  <programme start="20160430015000 +0200" stop="20160430023500 +0200" channel="0-3">
  <programme start="20160430023500 +0200" stop="20160430032600 +0200" channel="0-3">
  <programme start="20160502032700 +0200" stop="20160430063000 +0200" channel="0-3">
  <programme start="20160502032700 +0200" stop="20160502060000 +0200" channel="0-3">

and the last 2 items have start time AFTER stop time is this causing the add to fail ? Would a simple solution be to set start time to previous entry stop time, if start time is illogical ?

pensionado commented 8 years ago

See the pasted entries got dropped will try again `

hikavdh commented 8 years ago

If you add: write_info_files = True to your config, some extra output is created in ~/.xmltv. Among them is fetched-programs, which gives a short list of programs per source and the resulting merges.
Yes! You should place those entries in between backquotes

hikavdh commented 8 years ago

Or triple backquotes as I just did

hikavdh commented 8 years ago

And no, they should get removed. as said earlier. Stop time is in part deduced, start time is always leading.

hikavdh commented 8 years ago

My guess is that there is due to a gap a mix-up on the date.

pensionado commented 8 years ago

Just applied your recommended change to my config and reran the extract. And now the entries are no longer missing (and the illogical start times are gone), so I can send you the data, but dont think it is of any use, since the problem didnt happen in this run. If you want them I can send them (all the extra files ?). In the meantime I will keep an eye out for illogical stop/start times and if that happens get all the necessary data.

hikavdh commented 8 years ago

Thanks! I have to go. I'll dig deeper later.

pensionado commented 8 years ago

On my last rus this morning had the situation where start was greater then stop (output of my check: Found program for npo2 where start(20160503031100 +0200)>stop (20160501072500 +0200). Found program for npo3 where start(20160503024500 +0200)>stop (20160501062500 +0200). An earlier run did not get this problem, so maybe this is due to the site being updated ?

Have attached all the files.

hikavdh commented 8 years ago

Thanks I'll see what I can find. I think the failure is more caused by the site being to busy. The only site I have found really being unreachable some time in the early hours is It's date change occurs somewhere between 4 and 6. You can try raising the global_timeout from 10 to maybe 15

hikavdh commented 8 years ago

This is getting weirder. The second page of three on failes, but is successfully fetched on the second try. On all channels for one day are on the same page. Somehow the output for npo1 is OK, but the data for npo2 and 3 gets corrupted. And then as said before this backwards program item should get thrown away, instead the good programs are thrown away. So in essence two bugs colluding, that normally would not give a problem.
By the way there is a beta fixing the failures.

hikavdh commented 8 years ago

I think I found why the faulty program does not get removed. I added a fix to the beta: Actually this is a very old code part from before my time and they simply forgot to add the delete after detection. ;-)
I'll look further for the real cause, but you now should get proper output.

hikavdh commented 8 years ago

I found the cause. It comes from the days not being fetched in order. Now I have to think about how to fix this.

hikavdh commented 8 years ago

The new beta: (next to solving the again changed url) now also fixes the underlying issue!