100% CPU load while running: is that 'normal'?

fransg commented 9 years ago

Hi, is it normal that I am experiencing 100% CPU-load (on one out of 4 cpu's only) while running? The cpu gets real hot for the duration of a run, which takes about an hour.

I plan to look into the cause of this myself. First, I converted the script to Python3, to see if that would make any difference. I does not. :-) Next thing, I built in some diagnostics/debugging code and I see that in fetching data for some 20 channels, it creates 28 threads. Then it seems to be working on fetching data for 2 channels concurrently. Once one of these channels finishes, the 'threading.active_count()' is reduced by one and a new channel starts being processed. I will dig into this some more. If anyone has any pointers, then please share.

I am running on an Intel I5-4670 Haswell cpu, Fedora 22 Linux. Earlier I ran with python 2.7, currently python 3.4.

hikavdh commented 9 years ago

No it's not normal, but I've heard of it from people on hangs, but never seen it on my systems. So I don't know how. You're right about the thread count (the main thread, one for every channel, one for every source(5) plus one that monitors and at interval saves the cache, so 27 in your case), but they are mostly waiting. The time/speed factor is the number of fetches (they have a wait loop to prevent overburdening the sources). So I separated the fetching by the sources from the processing by the channels, who fill a queue with requests to the sources and then wait for the result and who create in the end the output. This way you also can set channel specific options. It is as if the while True loops who only check a variable create all kind of activity. Maybe putting a small wait-timer in those loops? While I did program in the past, I only discovered python last year, so I still have to find out a lot on it's specifics. So help is always welcome! If in your configuration you set write_info_files = True you get or can create extra debugging output. This is all in the InfoFiles class. Hika

hikavdh commented 9 years ago

Ah I suddenly remember what your thread 28 is. It's the FetchURL class. It's old pre v2 code that does the actual url request with a timeout.

fransg commented 9 years ago

OK, thanks. I will evaluate the flow and tweak where needed, probably in the coming weekend. To be continued!

fransg commented 9 years ago

Hi Hika (and others), here is a preliminary result. There are two cases where the script is flooding itself with status inquiries. That's in classes ChannelConfig (it starts a thread to fetch program details; that will take several minutes to complete, but it requests a status update continuously) and ProgramCache (there is a cache save interval, but the logic is such that it continuously checks to see if the cache should be save already). This causes an extremely high CPU-load that is unnecessary. There is an easy fix for these two flaws in the script, that will ensure that the CPU-usage will drop from around 100% to less than 1%, whilst having the same duration for an update run. The CPU-load will still be high during processing of cached data, but that is only for a relatively short period of time. Having said this, the preferred solution is to do some serious refactoring. I will have a go at that, step by step. In the meantime, I suggest to implement these two quick fixes and beta-test this on various platforms, hardware, configurations, etc.

Could you make the required changes or do you want me to do that?

Details on the fixes:

class ChannelConfig, method run(self), directly below statement while True, insert a statement: time.sleep(2)
class ProgramCache, method run(self), directly below statement while True, insert a statement: time.sleep(5) The exact values can be tweaked based on the test experiences. Please suggest and we'll take it from there.

Best regards, Frans

hikavdh commented 9 years ago

Thanks! I'll look at those while sleep commands. As said I'm relatively new to python and it's quirks and am all the time finding new things. Basically I learned the language last winter by writing this version 2. My standard impulse is to just write what I want, but I have to look more at what's available, but I not by far jet have a real overview of the multitude of libraries. I've in this respect been looking for some time at the different mechanisms in threads for locking and signaling, but still need to do a lot more reading. The thing is that a simple join wouldn't work. First the thread is not finish as a total, only a part. Secondly it needs to keep an eye on a general quit. And some more things. Anyway any real changes have to go into the 2.2 branch. It already has some restructuring in the logging and the configuration. I yesterday pushed what I had standing.

Hika

hikavdh commented 9 years ago

I also looked at the other wait/while loops loops. I don't know if you looked at 2.1.9 or 2.1.10. A few changes around these loops to prevent hanging were introduced in 2.1.10. In FetchData I moved the general wait a bit up to cover the waiting from the tvgids.tv thread for possible follow-ups on failures on tvgids.nl. In Channel_Config I added two waits to both the while loops. First 1 second when waiting for the basepages from the different sources to come in and secondly 2 seconds when waiting for the detailpages to come in. In these three spots is now (in 2.1.10) also code to check alivenes between sources and channels.

In the cache I might be able to remove the wait loop, I have to check further. The counter is only raised on adding, so I can do the checking there. I only want to keep doing the saving in parallel for that can take long. So the question is if I can start that thread to only save the cache from within the class? I then only have to arrange the cleaning up on closing different. But that can go into Configure.close(). Hika

hikavdh commented 9 years ago

Check this one out: https://github.com/tvgrabbers/tvgrabnlpy/releases/tag/beta-2.1.10-p20150807

hikavdh commented 9 years ago

It kept nagging in the back of my mind. It needs a small adjustment in the ProgramCache class. Line 3050 should be: if self.counter > config.opt_dict['cache_save_interval']: Because within the 5 seconds wait multiple adds can easily happen! I'll update tomorrow.

fransg commented 9 years ago

Nice, thanks!I have tested a few runs with the updated script and it now puts very little load on the cpu whilst still performing well, which is good. Reading/processing the cache still creates a heavy load; I will look into that to determine whether that is OK (it is just being very busy) or whether things need to be further optimized. I used 2.1.10 dated 2015-08-01 as the basis for my changes. And about your remark in the previous post: you are right, of course, that is needed.

fransg commented 9 years ago

Hi Hika, I have just submitted a code contribution (updated tv_grab_nl).. not sure whether if that was the correct way at all. Can you see my contribution? Is this something you can proceed with? Or do I need to do something first? Please let me know. Best regards, Frans

hikavdh commented 9 years ago

I don't know how/what you did, but I only see your message.

hikavdh commented 9 years ago

As you saw I incorporated part off your pull. I await further suggestions. If you create a pull for the new log_file parameter for the tvgrab22 branch I'll take it in. I would like to ask if you have ideas for issue 33 about the blocking cooky pop-up. The script either has to interactively react to it or if it then creates a cooky, create one itself in the right place and afterwards clear it again. I for the moment have no idea how to attack this. It's not extremely urgent because there are the backup json pages, but they contain less data. Again thanks for your input! :-) Hika

winfried commented 9 years ago

On 08/07/2015 10:44 PM, Hika van den Hoven wrote:

Hi Hika,

I'll look at those while sleep commands.

:-) Yes, they are a real CPU-saver, even if you use a short sleep period like 'sleep(0.05)'.

As said I'm relatively new to python and it's quirks and am all the time finding new things. Basically I learned the language last winter by writing this version 2. My standard impulse is to just write what I want, but I have to look more at what's available, but I not by far jet have a real overview of the multitude of libraries. I've in this respect been looking for some time at the different mechanisms in threads for locking and signaling, but still need to do a lot more reading. The thing is that a simple join wouldn't work. First the thread is not finish as a total, only a part. Secondly it needs to keep an eye on a general quit. And some more things.

Well, I have been programming quite complex stuff in Python for 13 years now, so I have seen some of its quirks... Unfortunately I have too little time to translate my knowledge into tvgrabnlpy code, though I tried a bit here: https://github.com/winfried/tvgrabnlpy

I am convinced that most of the issues raised in the tvgrabnlpy mailinglist last months are a combination of quirks in the error handling combined with issues in the threading. So, if you don't mind, I have three suggestions that may improve the stability of the code.

1) Use of Exceptions (See also my fork, I had this almost fixed) In the code there is quite often a construct like this:

try: do something except: do something else

Is often used in tvgrabnlpy to handle situations like an item that is not found. But a generic 'except' like this catches ALL errors. So also if something else went wrong, it falls back to 'doing something else' and you will never know what went wrong. To make things worse: even a ctrl-c (which raises a KeyboardInterrupt exception) may cause that except to trigger. So the code after a bare 'except' should always handle other kind of exceptions too and log what happened, otherwise you will never know what exactly happened.

For the case of the existence of an item, you may use a specific exception catcher like:

try: do something except IndexError: do something else

or better:

if item: do something else: do something else

('if item:' may also be something like: 'if len(item) > 1:' or 'if 'description' in item:')

2) Threading I think it is a natural choice to use threading in tvgrabnlpy: there are many fetches that can be executed in parallel. But threading in Python is quite dangerous: Python does not automatically handle the issues that may arise from parallel processing. In short: you can never assume that a shared resource you are using is accessible or in a consistent state, because an other thread my be accessing it at the same time. To make matters worse: not al Python modules are 'thread safe'. For example I have run into a XMPP module that ended up mixing up sockets when used in a threading environment. So you have to check for every module you use in a thread if it is 'threadsafe'.

In tvgrabnlpy I observed two actions that are very prone to threading issues: writing to the cache and writing to the logfile.

There are basically 2 strategies to avoid threading issues in Python:

Use locking and when an other threads wants to write to the same resource, let it sleep until the other thread left it in an consistent state.
Create one 'writer thread' and let the other threads use queues to send their data to that writer. See also: https://www.safaribooksonline.com/library/view/python-cookbook-3rd/9781449357337/ch12s03.html (BTW: the Python cookbook is a great resource, if you want to, I can give you my paper version of it).

Also you should be aware that a keyboard interrupt or a kill signal results in an exception in the thread (and in the code) that just happens to be executed at that moment. So every thread should be able to handle those signals and coordinate with the controlling thread to close all threads down.

3) Logging The python 'logging' module is a bit of a beast to set up and scares of people (on a Python conference somebody once called 'the school example of an over-engineerd module'). But it can be a real rescuer for tvgrabnlpy because it is thread safe. Beside that the Python logging module has some really nice goodies, like a nice (and adaptable if you want to) structure of loglevels including debug logging and a nice 'log.exception()' function that formats and logs a stack trace. Don't reinvent the wheel in a threaded environment like tvgrabnlpy, but use the Python logging module.

To go back to my general observations on tvgrabnlpy: I believe in many cases something like this happens:

an error occurs in a thread (for example an issue with a fetch)
the exception does not get logged properly because of the generic exception catchers or because of threading issues
the script doesn't terminate correctly, it just waits for the hung thread to terminate (which it doesn't because its error handling isn't up to that) Result: a mysteriously hung script without a proper error message in the log.

Right now we are hunting the initial errors and try to fix those. But it takes lots of energy because we have vague hangs (partly because of threading issues and partly because there are no clear error messages). With some structural improvements, it should be possible to improve the stability of tvgrabnlpy a lot and to make fixing issues a lot easier.

I hope this helps. I am very limited in my time, but if there is anything more I can do, please let me know. Also, Hika, if you like to have paper books on Python: I can donate you my (old) Python books. Right now I only use online documentation.

Winfried

hikavdh commented 9 years ago

Thanks Winfried! I actually found your fork some time ago and took a lot from it. I wondered why you didn't call back after putting that effort in there. Most of the things you mention have by now of sorts been implemented. I actually in the recent past rewrote the logging module for the new 22 branch. It's now an independent thread fed by a queue, to group lines you send them as a list and if the file is not jet open it holds all non critical messages. Like a good programmer I learn mainly by example (or to say more crude by stealing and adapting) and I learn fast, but there is a lot. But while I have a born knack for programming, Python feels like made for me. I should have discovered it years ago. Actually that sentinel message mentioned in your link to safaribooks is exactly what I was looking for for the communication between the channels and the sources. The queue is there and it works better then counting. I also plan on doing a full code review before I publish the 22 branch. I think the code is growing to maturity. Next step is a graphical configuration tool. And yes please I would love to wrestle through that book! :-)

Hika

winfried commented 9 years ago

On 09-08-15 13:07, Hika van den Hoven wrote:

Hi,

Thanks Winfried! I actually found your fork some time ago and took a lot from it. I wondered why you didn't call back after putting that effort in there.

I am a bit too perfectionist, I wanted my work to be complete before issuing a pull request...

And yes please I would love to wrestle through that book! :-)

Please send me a mail directly, then we can work out how to get my books with you.

Winfried

hikavdh commented 9 years ago

Hi Frans, See: https://github.com/tvgrabbers/tvgrabnlpy/releases/tag/beta-2.1.10-p20150810 I also introduced messenger packages in the queue, which at first gave me quite a headache. At the time one of the two queue's ran out it froze but sometimes earlier. It took me some time to find the cause in the channel joins in main(). You can not join an already terminated thread, the whole process will freeze. So I added a check on aliveness. Can you check on the influence on cpu load with the sleep replaced by the Event.wait(5). I'm now going to add these Events to the 22 branch and am thinking of adding an overall statistics report. I now close this issue as the high load is solved.

Hika

fransg commented 9 years ago

OK, thanks for the update. I will take it for a spin on my server and get back to you (maybe later today, maybe tomorrow).

Best regards,

Frans

Hika van den Hoven schreef op 10-8-2015 om 19:54:

Hi Frans, See: https://github.com/tvgrabbers/tvgrabnlpy/releases/tag/beta-2.1.10-p20150810 I also introduced messenger packages in the queue, which at first gave me quite a headache. At the time one of the two queue's ran out it froze but sometimes earlier. It took me some time to find the cause in the channel joins in main(). You can not join an already terminated thread, the whole process will freeze. So I added a check on aliveness. Can you check on the influence on cpu load with the sleep replaced by the Event.wait(5). I'm now going to add these Events to the 22 branch and am thinking of adding an overall statistics report. I now close this issue as the high load is solved.

Hika

— Reply to this email directly or view it on GitHub https://github.com/tvgrabbers/tvgrabnlpy/issues/35#issuecomment-129548711.

fransg commented 9 years ago

Hi Hika, no good news I'm afraid. CPU-usage is OK, but I tried three times and each time it appeared to hang. The first time I started it with a populated cache, twice without. I tried different combinations of channels. It gets stuck (i.e. no further screen updates, no file updates) very quickly, after perhaps 10 seconds. I haven't waited very long. I will let the current run continue to see if it ends at all. In 'top' I see a frequent but very limited activity for python2 (blips of less than 1% cpu-usage).

Hope this helps.

I am happy to help you troubleshoot, but the upcoming days I am busy at work. When I have an opportunity I will have a look.

Best regards,

Frans

Hika van den Hoven schreef op 10-8-2015 om 19:54:

Hi Frans, See: https://github.com/tvgrabbers/tvgrabnlpy/releases/tag/beta-2.1.10-p20150810 I also introduced messenger packages in the queue, which at first gave me quite a headache. At the time one of the two queue's ran out it froze but sometimes earlier. It took me some time to find the cause in the channel joins in main(). You can not join an already terminated thread, the whole process will freeze. So I added a check on aliveness. Can you check on the influence on cpu load with the sleep replaced by the Event.wait(5). I'm now going to add these Events to the 22 branch and am thinking of adding an overall statistics report. I now close this issue as the high load is solved.

Hika

— Reply to this email directly or view it on GitHub https://github.com/tvgrabbers/tvgrabnlpy/issues/35#issuecomment-129548711.

hikavdh commented 9 years ago

That's weird. I had this, but only after some time and I attributed this to a join to a no longer living thread. After I fixed that it ran perfect. Your activity then should be that waiting join.

hikavdh commented 9 years ago

I tried again and just to be sure I have the same version as you, I downloaded it. And it runs OK. You can try two things. First I now see that if my earlier conclusion stands line 9153 also needs an alive check. Secondly: remark line 9146 - 9148, they were what gave me the hangs and headaches. You however might want to put a 1 second wait there. That is about the max time the last channel needs to finish after the last source finishes.

Alternatively your system is missing a resource I have present. But then you would expect an error.

You can also check what https://github.com/tvgrabbers/tvgrabnlpy/releases/tag/beta-2.1.10-p20150809 does. That one is with the Events but without the closing messages.

Hika

fransg commented 9 years ago

The version I am running is "The Netherlands: tv_grab_nl_py (Version: 2.1.10-p20150810-beta)"

Yesterday's run was still not finished when I checked this morning. I re-started it this morning again without cache and that ran OK. Then I started again (with the cache populated) and that got stuck again. So there is something not quite right, it seems.

I think we need to go for the option of refactoring the multi-threading logic. While that does take some time and effort, it is the only fundamentally right approach. If I have a little time I'll jump in.. :-)

Best regards,

Frans

Hika van den Hoven schreef op 10-8-2015 om 23:30:

I tried again and just to be sure I have the same version as you, I downloaded it. And it runs OK. You can try two things. First I now see that if my earlier conclusion stands line 9153 also needs an alive check. Secondly: remark line 9146

9148, they were what gave me the hangs and headaches. You however might want to put a 1 second wait there. That is about the max time the last channel needs to finish after the last source finishes.

Alternatively your system is missing a resource I have present. But then you would expect an error.

You can also check what https://github.com/tvgrabbers/tvgrabnlpy/releases/tag/beta-2.1.10-p20150809 does. That one is with the Events but without the closing messages.

Hika

— Reply to this email directly or view it on GitHub https://github.com/tvgrabbers/tvgrabnlpy/issues/35#issuecomment-129614028.

hikavdh commented 9 years ago

Ok, I added locking for the counters, which seems to solve the issue and to be complete also on the source data: https://github.com/tvgrabbers/tvgrabnlpy/releases/tag/beta-2.1.10-p20150811

tvgrabbers / tvgrabnlpy

100% CPU load while running: is that 'normal'? #35