nim-lang / Nim

Nim is a statically typed compiled systems programming language. It combines successful concepts from mature languages like Python, Ada and Modula. Its design focuses on efficiency, expressiveness, and elegance (in that order of priority).
https://nim-lang.org
Other
16.55k stars 1.47k forks source link

Awaiting for Future[void] fails if future never completes or fails #14564

Open Q-Master opened 4 years ago

Q-Master commented 4 years ago

When trying to wait for the guard future I've encountered a wrong behaviour, IMHO. The idea is to wait some arbitrary future, which might be completed from some outer code not depending on guarded async code completed or not.

This code

import asyncdispatch

var guardfuture = newFuture[void]("test.guardfuture")

proc a() {.async.} =
    await sleepAsync(1000)

proc b() {.async.} =
    var af: auto = a()
    await guardfuture
    await af

proc fireGuard() =
    guardfuture.complete()

proc testGuard() {.async.} =
    await b()

proc stop() {.noconv.} =
    fireGuard()

when isMainModule:
    setControlCHook(stop)
    waitFor testGuard()
    quit(QuitSuccess)

should wait forever till ctrl+C pressed, but it fails just after proc a completes, with a message

/usr/local/Cellar/nim/1.2.0/nim/lib/pure/asyncdispatch.nim(1886) waitFor
/usr/local/Cellar/nim/1.2.0/nim/lib/pure/asyncdispatch.nim(1576) poll
/usr/local/Cellar/nim/1.2.0/nim/lib/pure/asyncdispatch.nim(1291) runOnce
Error: unhandled exception: No handles or timers registered in dispatcher. [ValueError]

but if I comment out awaiting the guardfuture in proc a, everything works without errors, but not as expected and quits just after a() completes.

If I press ctrl+c while timer is ticking, everything also works as a charm.

$ nim -v
Nim Compiler Version 1.2.0 [MacOSX: amd64]

The latest build from source has the same issue.

$ ./bin/nim -v
Nim Compiler Version 1.3.5 [MacOSX: amd64]
dom96 commented 4 years ago

Might be about time we just make the event loop call sleep if it has nothing else to do. If you want a workaround add a dummy timer using addTimer.

Q-Master commented 4 years ago

Adding dummy timer helped, thanks. But anyways this is a weird workaround, so waiting for sleep implementation.

dom96 commented 4 years ago

Most apps will be doing some actual IO work, if you're not then there is no reason to use async await. So in real applications you shouldn't need that workaround.

I wouldn't hold my breath on a fix here.

Q-Master commented 4 years ago

As I wrote in example you definitely not know if guarded async task finished or not, but the code should wait for some event. The code should not fail with exception if the main loop has no more tasks to do. It might happen that there's some sync code without IO working in the background, and will produce the event which restarts the async part of the code.

Araq commented 4 years ago

@dom96 I think this should work and not everybody uses async only for IO, a task system is generally useful. Is it hard to fix/change?

dom96 commented 4 years ago

Like I said, I guess we can just add a sleep in this case. But I haven't thought hard about it.

capocasa commented 2 years ago

Just got bit by this- understandable as an 80/20 rule thing but definitely causes adoption friction

dom96 commented 2 years ago

Definitely and I'd love for this to be fixed.

Some questions:

capocasa commented 2 years ago

I don't know, I'd have to learn a great deal about python implementation details from scratch to find the relevant code.

The best I can do right now is share my hunch that this is one of those situations where the best available solution is to accept that we are going to create a wart. I suspect python did that too because they had to find a solution within the same constraints. Like you said- why do async without i/o? If you're not waiting for an i/o, you by definition have everything you need to continue, so don't await anything, just run the code now.

After a fair bit of trying, the only use-case for async without I/O I could come up with is a dummy program to learn more about how async behaves.

Therefore, I'm fairly confident it would be okay to scrap the error message and create a dummy timer in its place. That's pretty darn ugly, but that's okay, because it's only ever going to be in people's learning code.

dom96 commented 2 years ago

I don't know, I'd have to learn a great deal about python implementation details from scratch to find the relevant code.

Not necessarily. I just mean: does Python also exit with an error? Does it spinlock the CPU up to 100%? or does it just work (TM)?

capocasa commented 2 years ago

Oh, okay- yeah in Python it just works, the reproduction code for this issue is literally python's Hello World example.

Not sure what your frame of reference here is, if by prototyping you mean describing desired behavior- Yeah I think this should just work in Nim too.

# Python

# this works
import asyncio

async def main():
    print('Hello ...')
    await asyncio.sleep(1)
    print('... World!')

asyncio.run(main())
# Nim

# this errors out with "No handles or timers registered in dispatcher."
import asyncdispatch

proc foo() {.async.} =
  await sleepAsync(1000)
  echo "foo done"

asyncCheck (foo())
runForever()
dom96 commented 2 years ago

Those examples aren't equivalent though, asyncio.run is equivalent to Nim's waitFor:

import asyncdispatch

proc foo() {.async.} =
  await sleepAsync(1000)
  echo "foo done"

waitFor foo() # This shouldn't fail

or maybe I'm wrong and asyncio.run will actually run forever?

capocasa commented 2 years ago

Yeah, you're right! Here's the actual equivalent:

# Python

import asyncio

async def main():
    print('Hello ...')
    await asyncio.sleep(1)
    print('... World!')
    # await asyncio.start_server(lambda r, w : nil, 'localhost', 9999)

asyncio.ensure_future(main())
asyncio.get_event_loop().run_forever()

Doesn't seem to make any difference, works fine.

I got the idea to trace the system calls instead of code digging in the python sources- if I interpret the linuxese correctly, python does stuff on a nonexistant dummy, or perhaps something like stdio?

strace python r14564.py
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=39502, ...}, AT_EMPTY_PATH) = 0
read(3, "\"\"\"Event loop using a selector a"..., 39503) = 39502
read(3, "", 1)                          = 0
close(3)                                = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f136bb78000
epoll_create1(EPOLL_CLOEXEC)            = 3
socketpair(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0, [4, 5]) = 0
getsockname(4, {sa_family=AF_UNIX}, [128 => 2]) = 0
getsockname(5, {sa_family=AF_UNIX}, [128 => 2]) = 0
ioctl(4, FIONBIO, [1])                  = 0
ioctl(5, FIONBIO, [1])                  = 0
epoll_ctl(3, EPOLL_CTL_ADD, 4, {events=EPOLLIN, data={u32=4, u64=139719581106180}}) = 0
getpid()                                = 20533
epoll_wait(3, [], 1, 0)                 = 0
write(1, "Hello ...\n", 10Hello ...
)             = 10
getpid()                                = 20533
epoll_wait(3, [], 1, 1000)              = 0
epoll_wait(3, [], 1, 0)                 = 0
write(1, "... World!\n", 11... World!
)

However, if the line containing asyncio.start_server is uncommented, you get this portion of system calls:

getsockname(4, {sa_family=AF_UNIX}, [128 => 2]) = 0
getsockname(5, {sa_family=AF_UNIX}, [128 => 2]) = 0
ioctl(4, FIONBIO, [1])                  = 0
ioctl(5, FIONBIO, [1])                  = 0
epoll_ctl(3, EPOLL_CTL_ADD, 4, {events=EPOLLIN, data={u32=4, u64=139968689209348}}) = 0

So although the details are murky to me, it seems clear that python created special code for the no-io case to ensure it Just Works (tm), apparently by opening some kind of dummy file to work on.

In my humble opinion, it doesn't really matter how Nim solves it internally, as long as it Just Works (tm) as well. Since all solutions I can think of have rely on special treatment for the edge case, all seem to increase complexity by the same amount to solve the same number of cases (1). There for they are most likely all equally ugly. I'd probably go for adding a dummy timer since it's the first one one you came up with as a workaround. But using some kind of dummy file for the epoll call in the absence of a proper file or socket seems fine as well, possibly more elegant, most likely a lot more work.

capocasa commented 2 years ago

I was playing around with the code in lib/pure/asyncdispatch.nim (Nim 1.6.2) and by coincidence I commented out the ValueError at the beginning of the runOnce proc and ran my reproduction code.

# lib/pure/asyncdispatch.nim
# (...)
  proc runOnce(timeout = 500): bool =
    let p = getGlobalDispatcher()
    # if p.selector.isEmpty() and p.timers.len == 0 and p.callbacks.len == 0:
    #  raise newException(ValueError,           
    #    "No handles or timers registered in dispatcher.")
    result = false
    var keys: array[64, ReadyKey]
    let nextTimer = processTimers(p, result)
# (...)

I expected the CPU to saturate, but, interestingly, that didn't happen. On the contrary, my reproduction code now worked as expected.

# myasync.nim
import asyncdispatch

proc foo() {.async.} =
  await sleepAsync(1000)
  echo "foo done"
  # look ma! no timer!

asyncCheck (foo())
runForever()

Doing a trace, the loop behavior looks normal.

$ nim c myasync.nim
$ strace ./myasync

epoll_create1(EPOLL_CLOEXEC)            = 3
epoll_wait(3, [], 64, 500)              = 0
epoll_wait(3, [], 64, 500)              = 0
newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}, AT_EMPTY_PATH) = 0
brk(NULL)                               = 0x55f6ec3af000
brk(0x55f6ec3d0000)                     = 0x55f6ec3d0000
write(1, "foo done\n", 9foo done
)               = 9
epoll_wait(3, [], 64, 500)              = 0
epoll_wait(3, [], 64, 500)              = 0
epoll_wait(3, [], 64, 500)              = 0
epoll_wait(3, [], 64, 500)              = 0

I also tried it on Windows, same result- this does not appear to be OS specific.

I don't quite understand well enough yet where the io-wait happens in asyncdispatch to attempt an explanation. I would guess that either a related commit accidentally fixed the problem (reliably or not), or my test case does not reproduce the CPU saturation issue, just the ValueError-based mitigation.

dom96 commented 2 years ago

I think that's because asyncdispatch.poll's default timeout is 500ms. Try this and you should see the CPU getting cooked:

# myasync.nim
import asyncdispatch

proc foo() {.async.} =
  await sleepAsync(1000)
  echo "foo done"
  # look ma! no timer!

asyncCheck (foo())
while true: asyncdispatch.poll(0)

But maybe this does mean we can get rid of this error since most won't be calling poll with a super low timeout like this. Dunno, would like to hear thoughts :)

capocasa commented 2 years ago

CPU getting cooked

Like an egg.

get rid of this error since most won't be calling poll with a super low timeout

I'm almost convinced that would be a good idea, but I am feeling paranoid about ways we didn't foresee that would cause the CPU saturation without it being obvious what caused it. It doesn't get much harder to debug than that.

I wonder if it would be a nice quick fix to get rid of the error message and then simply display a warning when the runOnce timeout is 10ms or below.

dom96 commented 2 years ago

I'd still be interested to learn how Python handles this case :)

capocasa commented 2 years ago

I think python handles all the cases differently- if you strace the python reproduction code

# Python

import asyncio

async def main():
    print('Hello ...')
    await asyncio.sleep(1)
    print('... World!')
    # await asyncio.start_server(lambda r, w : nil, 'localhost', 9999)

asyncio.ensure_future(main())
asyncio.get_event_loop().run_forever()

you see that epoll_wait blocks until there is something to do, so there is no equivalent of while true: poll(123), it's more like ẁaitUntilThereIsSomethingToDo: nowRunIt()`

If there is no fixed interval to wait for, that interval can't be zero. Interesting solution.

dom96 commented 2 years ago

Ahh, but then Python must be using actual timers unlike Nim for asyncio.sleep. As in timers that are represented by a file descriptor which can be passed to epoll. It would be nice to confirm this is the case, because the main drawback of these is that it then means you are limited by your system's FD limit. If they have a different solution to this problem it would be really nice to learn about it.

Q-Master commented 2 years ago

Python in base events of asyncio runs almost the same as asyncdispatch does:

while True:
    check for cancelled events
    check the timeout to nearest event or 0
    wait for time on selectors
    do all callback jobs

you might look here cpython/Lib/asyncio/base_events.py line 578 for run_forever() and line 1789 for _run_once() which is making the main job The main difference is that the timeout is not fixed, but depends on the nearest event.

capocasa commented 2 years ago

Okay so I caved :smile: and went digging in the python sources. The only epoll_wait call is the the select module, which is a thin wrapper. select is used by selectors, which is a higher level module that, according to the docs, you can use to schedule code to be run later. Sounds relevant :smile:

Here's the interesting part- sure enough, each _add_reader call takes a file descriptor, and indeed, in the _start_serving code, each event calls _add_reader once or more every time it receives an event registration call.

All of this looks to suspiciously like python asyncio uses a lot of file descriptors.

Sure enough, there is confirmation from the wild: A really well known python async framework author is offering a workaround to a power user, and from the naming of it that workaround does what Nim does- use one and poll it.

dom96 commented 2 years ago

Nice, thanks for diving into it! So yeah, sounds like we could follow what Python does. This actually makes me think that file descriptors aren't as big of a concern as I (and others) feared. So perhaps we should simply switch over to the "native" timers and provide a -d:... flag that can revert to the old behaviour.

File descriptor limits is actually what put a blocker on us being able to await spawn/channels (https://github.com/nim-lang/Nim/pull/12372). I think with this knowledge that even Python uses them for sleep, we should seriously consider using file descriptors for awaiting spawn/channels too :)

capocasa commented 2 years ago

Glad it was helpful!!!

Apparently Python is not alone in its approach of not being too worried about file descriptors- Apache seems to need a boost at times, and I remember now an Nginx setup where we had to increase FDs under load. On Windows, your program can apparently do it itself, but Linux requires some configuration.

Would it make sense to talk about how to best drop the error message with the goal of a small patch that could be backported to the 1.6 before moving forward with bigger changes?

I see several options with different merits:

(1) drop the ValueError (2) drop the ValueError, add a warning for timeout=0 (or possible something like <10) (3) drop the ValueError, but keep it for timeout=0 (or possibly something like <10)

I would probably be willing to gamble it's okay to pick any one of these and release it even in a point release- I just can't picture anyone relying on that ValueError for anything.

Q-Master commented 2 years ago

Setting ulimits is not that problem now when switched from select linux call to poll and epoll. I think that switching to python like manner of working with main loop and sleeps is a good idea.

dom96 commented 2 years ago

So thinking about it some more, it could be worth considering this in two different ways:

I think for the latter we may wish to keep the error as it's a good cue for folks that they're missing something.

Q-Master commented 2 years ago

After that in both cases it processes events got from selector line 1829 of base_events.py, then tries to process scheduled callbacks which might occure while previous steps lines 1831-1839 of base_events.py, then calls all the processed callbacks lines 1841-1865 of base_events.py

This all is running in infinite loop with breakage on stopping flag without any sleep. lines 578-599 of base_events.py run_until_complete works the same and calls run_forever from inside, but stops the loop on callback got from finished future.

capocasa commented 2 years ago

keep the error as it's a good cue for folks that they're missing something.

In my humble opinion, the reasons to get rid of the error are overwhelming.

(1) The error doesn't warn, it restricts, forcing user code to deal with the error as a special case, forcing unnecessary code complexity (2) Understanding the error, let alone dealing with it, require knowledge of implementation detail, resulting in a leaky abstraction (3) It was chosen overly broad in the first place, only making any sense for timeout=0 (4) It breaks naming- run_forever means run_forever, not "run unless murky special case A" turns up (5) The error likely triggered when first experimenting with async/await, resulting in a learning blocker, possibly turning evaluators off of Nim for perceived unnecessary complexity

Edit: Well, so much for the Extreme Urgency Of The Whatever Problem Is Nearest To My Nose (tm) :wink:

To approach it from the other end: Is there anything specific that would have to be adressed to be able to drop or change the error message behavior, if we should chose to do so?

I suppose we could also document the curent behavior: Warning: runForever will only run as long as it has at least one task to process or change the wording of the error message to reflect the abstraction, not the implementation- async ran out of tasks to perform, add at least one timer or task if you really really need to run forever. But I'd consider these suboptimal because Python is such a big influx of programmers so dropping the error would mean so much less friction for them.