Closed rinor closed 3 months ago
Basically this check ensures that the return value of a poll() or ppoll() syscall does not exceed the number of file descriptors nfds
supplied in the syscall arguments. Looking at the kernel source code I can't see how this assertion could fail, so no clue at the moment on how to try to reproduce the failure.
Closing this for now, since I'm unable to reproduce it. Will still keep an eye on it and reopen in case of new occurrences.
a different frame trace, same assertion
frame trace:
ffffc0003987fe10: ffffffff800c277b (wait_notify + 00000000000001cb/00000000000003c8)
ffffc0003987fe50: ffffffff800bf83f (notify_dispatch_with_arg + 000000000000009f/00000000000001a5)
ffffc0003987fec0: ffffffff800a922d (efd_write_bh + 00000000000001cd/00000000000001f2)
ffffc0003987ff20: ffffffff800a3113 (blockq_check_timeout + 0000000000000063/0000000000000289)
ffffc0003987ff80: ffffffff800d8d76 (write + 0000000000000166/0000000000000247)
ffffc0003987ffb0: ffffffff800e1a23 (syscall_handler + 00000000000002f3/0000000000000636)
loaded klibs:
assertion w->retval++ < (w->poll_fds->length / sizeof(struct pollfd)) failed at /nanos/src/unix/poll.c:939 (IP 0xffffffff800c0c63) in poll_notify(); halt
It looks like those assertions (and some more), happen when there is not enough free memory for the executing actions.
I identified a bug that can cause the above assertion failure (even though I haven't replicated that failure) in multi-vCPU instances. Fixed in https://github.com/nanovms/nanos/pull/2024/commits/37053f6e3618e786de36bf08d7ab8e9d754eaabe
thanks, will test again and report back.
it looks like this is fixed, since after 48 hours of testing under the same conditions as before, there are no more assertions (poll related).
I had this onetime occurrence and still trying to understand the cause and if this is smth that needs to be fixed on nanos or just a resource related issue (ram, fs) on the testing environment. I'm trying to re-trigger it with the same program, but it may take time (if it happens again).
Note: have yet to figure out the cause and how to reproduce it, so any help regarding the logical conditions that may cause this failure is highly appreciated.