benchmark.py throws SystemError

MichaelAz commented 10 years ago

When running benchmark.py as found in master, a SystemError is raised:

SystemError: (libev) select: Unknown error
Traceback (most recent call last):
  File "C:\Users\zeev\Desktop\goless\benchmark.py", line 111, in <module>
    main()
  File "C:\Users\zeev\Desktop\goless\benchmark.py", line 105, in main
    prime()
  File "C:\Users\zeev\Desktop\goless\benchmark.py", line 100, in prime
    bench_selects()
  File "C:\Users\zeev\Desktop\goless\benchmark.py", line 68, in bench_selects
    took_nodefault = bench_select(False)
  File "C:\Users\zeev\Desktop\goless\benchmark.py", line 62, in bench_select
    selecting.select(cases)
  File "C:\Users\zeev\Desktop\goless\goless\selecting.py", line 93, in select
    _be.yield_()
  File "C:\Users\zeev\Desktop\goless\goless\backends.py", line 138, in yield_
    gevent.sleep()
  File "C:\Python27\lib\site-packages\gevent\hub.py", line 73, in sleep
    waiter.get()
  File "C:\Python27\lib\site-packages\gevent\hub.py", line 569, in get
    return self.hub.switch()
  File "C:\Python27\lib\site-packages\gevent\hub.py", line 332, in switch
    return greenlet.switch(self)
SystemError: (libev) select: Unknown error

would_deadlock passes and this consistently happens at the 495 itteration (at least for me).

I'm running the code on Windows 7 with the gevent backend, and gevent==1.0.1, greenlet==0.4.2.

rgalanakis commented 10 years ago

Not getting this behavior on Linux with those versions. Will get a Windows7 virtualbox set up to try things out.

rgalanakis commented 10 years ago

If you just run: from goless.backends import current; current.yield_(), (something like that) what happens? gevent has never had an issue yielding on the last greenlet, so I don't know where this behavior is coming from... (I am adding some tests to verify this behavior). Will probably be late Sunday when I am able to look into this on Windows, have weekend plans.

MichaelAz commented 10 years ago

That code runs fine. I'll investigate further, see if I can find anything useful.

MichaelAz commented 10 years ago

So, something interesting right off the bat. The benchmark contains this code:

def main():
    prime()
    bench_channels()
    bench_selects()

prime just runs the benchmarks without writing any output, so we can ignore it, but an interesting thing happens when we comment out bench_channels - the error raised by bench_selects magically transforms into a Deadlock error.

The reason for this is that by running bench_channels the errors location changes. When it's run, the error happens in selecting.py, 93, in the statement _be.yield_(). When it' isn't run, the error happens in selecting.py, 92 in the statement return c, c.exec_(). exec_ causes a send\receive which is wrapped by the _as_deadlock decorator and thus causes a sane error. yield_ isn't wrapped by that decorator and because of that we get the cryptic error. So, perhaps we should think of wrapping exceptions thrown in yield_. Next.

Inside, bench_selects it is specifically the call to bench_select(False) that raises the exception. The reason for this difference in behavior is that by passing True to bench_select we cause a dcase to be added to the case list, so, when none of the other channels are ready the script doesn't throw, but rather uses that default case.

There's some subtle race condition here, I believe, with sending to a full channel, because switching to a buffered channel with buffer size 2. I honestly have no idea what's going on here but I re-wrote it from scratch and it seems to work now. Unless you find a better explanation for this behavior, I think I'll commit the re-written version.

rgalanakis commented 10 years ago

Ok, I've improved the behavior of asdeadlock to include the original stacktrace, and yield should not raise if its the last tasklet. I'll dig into this on Windows now.

rgalanakis commented 10 years ago

May take a while to get my Windows box set up for development... in the meantime, could you try with the tip of gevent in github?

There's some subtle race condition here, I believe, with sending to a full channel, because switching to a buffered channel with buffer size 2.

Yes very likely. We suspect this is why the pypystackless tests don't work either. I will work through this code and see.

Also going from the gevent docs, it appears libev has some problems on Windows- not just bugs but also uknown errors. There could also be some gevent->libev bugs on Windows.

rgalanakis commented 10 years ago

Ok so here's some progress for the morning. A bit of a mind-dump, maybe writing it out will help uncover something?

I can repro easily (on Windows only) by taking the bench_select code into a script and running that. Unfortunately the behavior disappears within a test framework or under the debugger!

This has nothing to do with a deadlock, so I've removed the as_deadlock catch for SystemError. We are putting gevent/libev into a bad state somehow- I suspect the same thing is happening that is causing pypystackless to be in a bad state. It's the same sort of thing- symptom is that there's no runnable tasklet or whatever, but that cannot really be. Solving one may solve the other! (See #2 )

This is where it gets interesting. On my machine, I consistently fail at iteration 997. However, if instead of:

def sender():
    while True:
        c.send(0)
        c.recv()

I have (you may need to import backends first):

def sender():
    while True:
        c.send(0)
        backends.current.yield_()
        c.recv()

I fail on iteration 499- which is about half of 997. Do you get the same behavior @MichaelAz , or is that just coincidence on my end? I suspect you are spot on, that the problem is send/recv to a full channel and the behavior that goes on there. The semantics are not totally clear- a blocked send will of course yield, but how about an unblocked send? I can't remember if its tested, or even defined. There are some potential problems to work through. Will keep the thread updated over the next few days.

Updates:

Update 1: Ah, looking at BackendChannelSenderReceiverPriorityTest, maybe something lies there...
Update 2: Wrote a test to verify that successful send or recv do not yield control. Found an issue! Investigating- feel like I'm on the right track.
Updated 3: Added tests to verify behavior of send/recv priority. Also found that I do need to also catch SystemError on Windows in event of a deadlock. from gevent.queue import Channel; Channel().get() will raise a SystemError on Windows but LoopExit on Linux.

rgalanakis commented 10 years ago

Okay, confirmed a few things. Basically, something that will deadlock or run perfectly well on Linux will raise on Windows:

from gevent.queue import Channel
import gevent
c = Channel()
def sender():
    while True:
        c.put(0)
gevent.spawn(c.put, 1)
for i in range(1000):
    gevent.sleep(0)

Will exit fine on Linux, will error on Windows. I also cannot replicate in all cases, like under a test runner.

I can catch the error in select and ignore it to replicate the Linux behavior on Windows. I am not sure what else I could do, and other than performance and more Windows bugs in the future, I'm not sure what else we can do. It's up to libev/gevent to fix.

MichaelAz commented 10 years ago

It's been a crazy week, I'll go over your updates more thoroughgly tomorow evening/friday morning.

MichaelAz commented 10 years ago

This is where it gets interesting. On my machine, I consistently fail at iteration 997. However, if instead of:

def sender():
    while True:
        c.send(0)
        c.recv()
I have (you may need to import backends first):

def sender():
    while True:
        c.send(0)
        backends.current.yield_()
        c.recv()

I'm getting the same behavior.

If this is really a bug in gevent on Windows we ought to open an issue with them. But, since this is (probably) related to the pypystackless bug - perhaps we're at fault. I really don't know. Could you link to the docs you mentioned about gevent having problems on windows?

rgalanakis commented 10 years ago

When I dug into it, I don't think the pypystackless and gevent-windows problems are related. I think this is genuinely a bug in gevent/libev on Windows, as I was able to repro it in a purely gevent environment (see my previous comment).

Regarding the links, I wish I had taken better notes. I can only find a few pages, mostly concerning gevent's switch from libevent to libev, and libev's inferior Windows support:

Specifically there was a page I cannot now find that said something like "There should be fewer unknown errors on Windows"- I think it was for gevent (a changelog?) but could have been for libev as well. I will open a ticket with the gevent repo.

MichaelAz commented 10 years ago

As discussed in surfly/gevent#459, adding a call to import socket seems to solve the issue.

I'm creating a PR for this, even though the solution is extremely hacky. I'd say calling WSAStartup with ctypes is less hacky but it'll just require us to re-implement the relevant part of socket and that's not DRY.

rgalanakis commented 10 years ago

Confirmed this fixed the issue on my Windows virtualbox. I am flabberghasted by this issue. Hopefully gevent fixes the actual problem. Any solution on our side is 'hacky' so don't worry about importing socket not being optimal.

rgalanakis / goless

benchmark.py throws SystemError #28