Closed nick1udwig closed 1 month ago
When booting a pre-existing indirect node with this binary I got a segfault. Not obvious to me where in the net
code there is a hanging peer
, but that's what this looks like to me:
Tue 09:51 net: connection lost with aficionad.os
Tue 09:51 net:distro:sys closed 1 of 20
diligence.os > m our@fd_manager:distro:sys "GetState"
Tue 09:52 process diligence.os@12665670855602396235:terminal:sys returned without error
Tue 09:52 process 12665670855602396235:terminal:sys has OnExit behavior None
Tue 09:52 kernel: killing process 12665670855602396235:terminal:sys
Tue 09:52 hq:uncentered.os: error: Internal sync failed: Failed to generate embeddings: Timeout
Tue 09:53 net: attempting to connect to aficionad.os directly
Tue 09:53 net: connected to aficionad.os directly
diligence.os > thread 'tokio-runtime-worker' panicked at kinode/src/net/connect.rs:11:30:
net: peer sender was dropped: SendError { .. }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Cleaning up "/home/nick/kinode-homes/diligence.os"...
Done cleaning up "/home/nick/kinode-homes/diligence.os".
Will look more into this this afternoon
Pretty sure I got it; confirming then fix coming here.
Fixed in aca7985. Unrelated fixes in a2fa276
By my count, the kernel / runtime itself uses about 20 file descriptors in the base case. I suggest that instead of configuring kernel, sqlite, kv, and terminal to use fd_manager, we simply add a fixed buffer that reserves the runtime FDs. Only vfs and net dynamically use a bunch of FDs / have the potential to overwhelm a limit at the moment.
Currently fd_manger uses a constant to give cushion from what the system gives:
const DEFAULT_FDS_AS_FRACTION_OF_ULIMIT_PERCENTAGE: u64 = 60;
I suggest that this gets bumped up to 90 and we tack on 30
as a fixed value that gets subtracted from whatever system ulimit we are given. If we are given less than 40 from system (which would give 9
as the final amount available to vfs/net.. not much to work with..) we crash with a descriptive error.
I suggest that instead of configuring kernel, sqlite, kv, and terminal to use fd_manager, we simply add a fixed buffer that reserves the runtime FDs. Only vfs and net dynamically use a bunch of FDs / have the potential to overwhelm a limit at the moment.
This doesn't necessarily have to be in 0.9.5 yet, but kv and sqlite can definitely use a bunch of FDs (loop open 500 sqlite dbs will error), if users have many apps installed that use them, should be quite simple to cull them given most open actions are already OpenOrCreate.
I'm ok with hardcoding stuff for now to get a functional version out for our users, but definitely would want to have a less-hardcoded solution in the future.
kv and sqlite FDs can definitely use a bunch of FDs
Wait really? I thought that both of these used one "db" and parceled userspace requests into separate tables and such.. if a new file descriptor is acquired we should def track
I see now that multiple DBs get opened. I'll wire these up to fd_manager.
@bitful-pannul need to you check my work here on refactor to kv
. sqlite
coming next
Sqlite work added.
Problem
We run out of file descriptors because things like TCP sockets consume them: #556
Solution
Introduce a new runtime module:
fd_manager
. It is pretty simple: it just tracks the number allowedfd
s. Runtime modules that consumefd
s must notifyfd_manager
when they open or close anfd
. Then, if the total amount offd
s passes some threshold,fd_manager
sends out a Request to all runtime modules with openfd
s toCull
their oldfd
s with a suggested fraction to cull. Runtime modules that consumefd
s must, therefore, also implement logic to respond to aCull
.Testing
TODO
Docs Update
TODO
Notes
TODO list:
fd_manager
skeletonfd_manager
helpersend_fd_manager_open(number_opened: u64)
fd_manager
helpersend_fd_manager_close(number_closed: u64)
fd_manager
fd_manager
fd_manager
fd_manager
fd_manager