gene_server.py crashes when it has too many gts.py clients

GoogleCodeExporter commented 9 years ago

My experience is that somewhere around 20-25 gts.py instances [on remote 
machines] most of the time make it crash within 10-15 minutes of the runtime.

Unfortunately, no error is to be seen, and the gal.py does not restart it 
[after all, it is 'unmonitored' =]

The problems can escalate if there are open orders at that moment, because the 
bcbookie also terminates when it can't contact the gene server, and leaves the 
orders open.

Adding gene server to monitored proceses [and assuming that it loads it's 
archived database automatically when [re]started by the gal.py] would lessen 
the impact.

bcbookie could cancel all the orders upon detecting an critical error [for 
example gene server not responding] before terminating.

Thank you =]

Original issue reported on code.google.com by purge...@gmail.com on 25 Feb 2012 at 7:18

GoogleCodeExporter commented 9 years ago

I added exception handling to bcbookie. If bcbookie can't connect to the gene 
server, no new orders will be generated. Otherwise, bcbookie will continue to 
run normally. Once the gene_server is restarted bcbookie will automatically 
resume full operation.

I've linked gal.py to the global_config.json file. Stderr filename variables 
have been defined in gal.py which can be set in the configuration file.

These changes are available on the code repository.

I'll look into the gene_server crashing after 20-25 instances.

Original comment by brian.mo...@gmail.com on 25 Feb 2012 at 11:40

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

I ran the system with 50 clients and didn't have an issue. However, I did 
notice that memory usage gets pretty high. I could see an issue with out of 
memory errors for systems with 6GB or less of available memory.

Original comment by brian.mo...@gmail.com on 29 Feb 2012 at 1:38

GoogleCodeExporter commented 9 years ago

Changed gal.py to launch the gene_server with pypy. The latest versions of pypy 
seem to run much better and will increase performance by an order of magnitude 
over regular CPython. Long term testing will be required to verify memory 
useage over time has improved. 

Added performance capturing function decorators to the gene_server exposed 
methods. This allows user monitoring of the gene server through the enhanced 
user interface (see /tools/nimbs folder for the node server.js), This captures 
execution time but doesn't monitor gene_server response times. This will need 
to be added to server.js.

Original comment by brian.mo...@gmail.com on 28 Dec 2012 at 8:45

Changed state: Started

GoogleCodeExporter commented 9 years ago

I've made the gene_server threaded, which reduced the number of crashes a lot. 
It used to crash with a 'connection closed by peer'.

You can find my commit here, it's based upon the current HEAD in your repo:

https://github.com/timstoop/ga-bitbot/commit/0fe1cf34734750beae0ffd400c8bb80ffee
38a89

Original comment by tim.st...@gmail.com on 24 Feb 2013 at 1:13

GoogleCodeExporter commented 9 years ago

For the record, I have this code running for almost a day now and I had a lot 
less issues than before. I run most scripts manually in a while loop which 
sends me an email when it restarts. In the 24 hours before I applied the patch, 
I had 166 of those emails, since I applied the patch, I only received 1.

Original comment by tim.st...@gmail.com on 25 Feb 2013 at 9:06

GoogleCodeExporter commented 9 years ago

http://effbot.org/zone/thread-synchronization.htm

If I need to place thread locks around every global variable access wouldn't I 
give up any benefit an async server would bring?

Original comment by brian.mo...@gmail.com on 26 Feb 2013 at 4:58

GoogleCodeExporter commented 9 years ago

Actually, I hadn't thought about that, but honestly, we don't need to threading 
for actual performance, we only need it to allow multiple clients to connect at 
the same time, which currently causes a 'connection reset by peer' error on the 
client, which terminates the process. The simple xmlrpc server is single 
threaded.

But good point about the locks, I hadn't thought about that. I think the 
easiest would be to import the with_statement and use a lock per request. I'll 
have a go at that this evening.

Original comment by tim.st...@gmail.com on 26 Feb 2013 at 7:43

GoogleCodeExporter commented 9 years ago

Ok, just had some time to look at it again. What do you think if this solution?

https://github.com/timstoop/ga-bitbot/commit/fdcb2464d08111b26b5640204b9bde9a877
ebf66

Original comment by tim.st...@gmail.com on 28 Feb 2013 at 3:21

GoogleCodeExporter commented 9 years ago

Looks good :) 

If I understand correctly this will allow asynchronous connections, which will 
prevent client connection time outs while under high load. The trade off is 
that each request will now spawn a new thread and must wait to acquire a lock 
before processing the request.

A lot of the XML-RPC calls have processing time in the sub-millisecond range. I 
wonder if we should go with a thread pool mixin to limit the reduction to the 
overall server throughput caused by constant thread spawning.

See:
http://code.activestate.com/recipes/574454-thread-pool-mixin-class-for-use-with-
socketservert/

Original comment by brian.mo...@gmail.com on 1 Mar 2013 at 6:34

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

We could at that, however, I currently have about 75 clients querying the same 
gene_server and I do not notice any obvious delay. The server only runs the 
server components, though, not any gts, so it's not that busy.

Original comment by tim.st...@gmail.com on 1 Mar 2013 at 10:42

khoffrath / ga-bitbot

gene_server.py crashes when it has too many gts.py clients #28