nanovms / nanos

A kernel designed to run one and only one application in a virtualized environment
https://nanos.org
Apache License 2.0
2.59k stars 134 forks source link

issue: assertion w->retval++ < (w->poll_fds->length / sizeof(struct pollfd)) failed at ../unix/poll.c:939 (IP 0xf..) in poll_notify(); halt #1963

Closed rinor closed 3 months ago

rinor commented 10 months ago

I had this onetime occurrence and still trying to understand the cause and if this is smth that needs to be fixed on nanos or just a resource related issue (ram, fs) on the testing environment. I'm trying to re-trigger it with the same program, but it may take time (if it happens again).

frame trace:
ffffc0001fe77f10:   ffffffff800c4ab2    (poll_internal.constprop.0 + 0000000000000282/0000000000000d4a)
ffffc0001fe77fb0:   ffffffff800e1a23    (syscall_handler + 00000000000002f3/0000000000000636)

loaded klibs:
assertion w->retval++ < (w->poll_fds->length / sizeof(struct pollfd)) failed at /nanos/src/unix/poll.c:939 (IP 0xffffffff800c0c63)  in poll_notify(); halt

Note: have yet to figure out the cause and how to reproduce it, so any help regarding the logical conditions that may cause this failure is highly appreciated.

francescolavra commented 10 months ago

Basically this check ensures that the return value of a poll() or ppoll() syscall does not exceed the number of file descriptors nfds supplied in the syscall arguments. Looking at the kernel source code I can't see how this assertion could fail, so no clue at the moment on how to try to reproduce the failure.

rinor commented 10 months ago

Closing this for now, since I'm unable to reproduce it. Will still keep an eye on it and reopen in case of new occurrences.

rinor commented 10 months ago

a different frame trace, same assertion


frame trace: 
ffffc0003987fe10:   ffffffff800c277b    (wait_notify + 00000000000001cb/00000000000003c8)
ffffc0003987fe50:   ffffffff800bf83f    (notify_dispatch_with_arg + 000000000000009f/00000000000001a5)
ffffc0003987fec0:   ffffffff800a922d    (efd_write_bh + 00000000000001cd/00000000000001f2)
ffffc0003987ff20:   ffffffff800a3113    (blockq_check_timeout + 0000000000000063/0000000000000289)
ffffc0003987ff80:   ffffffff800d8d76    (write + 0000000000000166/0000000000000247)
ffffc0003987ffb0:   ffffffff800e1a23    (syscall_handler + 00000000000002f3/0000000000000636)

loaded klibs: 
assertion w->retval++ < (w->poll_fds->length / sizeof(struct pollfd)) failed at /nanos/src/unix/poll.c:939 (IP 0xffffffff800c0c63)  in poll_notify(); halt
rinor commented 10 months ago

It looks like those assertions (and some more), happen when there is not enough free memory for the executing actions.

francescolavra commented 3 months ago

I identified a bug that can cause the above assertion failure (even though I haven't replicated that failure) in multi-vCPU instances. Fixed in https://github.com/nanovms/nanos/pull/2024/commits/37053f6e3618e786de36bf08d7ab8e9d754eaabe

rinor commented 3 months ago

thanks, will test again and report back.

rinor commented 3 months ago

it looks like this is fixed, since after 48 hours of testing under the same conditions as before, there are no more assertions (poll related).