add `fd_manager` runtime module

nick1udwig commented 1 month ago

Problem

We run out of file descriptors because things like TCP sockets consume them: #556

Solution

Introduce a new runtime module: fd_manager. It is pretty simple: it just tracks the number allowed fds. Runtime modules that consume fds must notify fd_manager when they open or close an fd. Then, if the total amount of fds passes some threshold, fd_manager sends out a Request to all runtime modules with open fds to Cull their old fds with a suggested fraction to cull. Runtime modules that consume fds must, therefore, also implement logic to respond to a Cull.

Testing

TODO

Docs Update

TODO

Notes

TODO list:

[x] Write fd_manager skeleton
[x] Write fd_manager helper send_fd_manager_open(number_opened: u64)
[x] Write fd_manager helper send_fd_manager_close(number_closed: u64)
[x] Update vfs to talk to fd_manager
[x] Update sqlite to talk to fd_manager
[x] Update kv to talk to fd_manager
[x] Update net to talk to fd_manager
[ ] Look for any other consumers of fds and update them to talk to fd_manager
[ ] Test

nick1udwig commented 1 month ago

When booting a pre-existing indirect node with this binary I got a segfault. Not obvious to me where in the net code there is a hanging peer, but that's what this looks like to me:

Tue 09:51 net: connection lost with aficionad.os
Tue 09:51 net:distro:sys closed 1 of 20
diligence.os > m our@fd_manager:distro:sys "GetState"
Tue 09:52 process diligence.os@12665670855602396235:terminal:sys returned without error
Tue 09:52 process 12665670855602396235:terminal:sys has OnExit behavior None
Tue 09:52 kernel: killing process 12665670855602396235:terminal:sys
Tue 09:52 hq:uncentered.os: error: Internal sync failed: Failed to generate embeddings: Timeout
Tue 09:53 net: attempting to connect to aficionad.os directly
Tue 09:53 net: connected to aficionad.os directly
diligence.os > thread 'tokio-runtime-worker' panicked at kinode/src/net/connect.rs:11:30:
                                                                                         net: peer sender was dropped: SendError { .. }
                                    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
              Cleaning up "/home/nick/kinode-homes/diligence.os"...
Done cleaning up "/home/nick/kinode-homes/diligence.os".

Will look more into this this afternoon

nick1udwig commented 1 month ago

Pretty sure I got it; confirming then fix coming here.

nick1udwig commented 1 month ago

Fixed in aca7985. Unrelated fixes in a2fa276

dr-frmr commented 1 month ago

By my count, the kernel / runtime itself uses about 20 file descriptors in the base case. I suggest that instead of configuring kernel, sqlite, kv, and terminal to use fd_manager, we simply add a fixed buffer that reserves the runtime FDs. Only vfs and net dynamically use a bunch of FDs / have the potential to overwhelm a limit at the moment.

Currently fd_manger uses a constant to give cushion from what the system gives:

const DEFAULT_FDS_AS_FRACTION_OF_ULIMIT_PERCENTAGE: u64 = 60;

I suggest that this gets bumped up to 90 and we tack on 30 as a fixed value that gets subtracted from whatever system ulimit we are given. If we are given less than 40 from system (which would give 9 as the final amount available to vfs/net.. not much to work with..) we crash with a descriptive error.

bitful-pannul commented 1 month ago

I suggest that instead of configuring kernel, sqlite, kv, and terminal to use fd_manager, we simply add a fixed buffer that reserves the runtime FDs. Only vfs and net dynamically use a bunch of FDs / have the potential to overwhelm a limit at the moment.

This doesn't necessarily have to be in 0.9.5 yet, but kv and sqlite can definitely use a bunch of FDs (loop open 500 sqlite dbs will error), if users have many apps installed that use them, should be quite simple to cull them given most open actions are already OpenOrCreate.

nick1udwig commented 1 month ago

I'm ok with hardcoding stuff for now to get a functional version out for our users, but definitely would want to have a less-hardcoded solution in the future.

dr-frmr commented 1 month ago

kv and sqlite FDs can definitely use a bunch of FDs

Wait really? I thought that both of these used one "db" and parceled userspace requests into separate tables and such.. if a new file descriptor is acquired we should def track

dr-frmr commented 1 month ago

I see now that multiple DBs get opened. I'll wire these up to fd_manager.

dr-frmr commented 1 month ago

@bitful-pannul need to you check my work here on refactor to kv. sqlite coming next

dr-frmr commented 1 month ago

Sqlite work added.

kinode-dao / kinode