princeton-sns / firecracker-tools

5 stars 5 forks source link

VM creation panics due to "Too many open files" #32

Closed LedgeDash closed 5 years ago

LedgeDash commented 5 years ago

Command run:

[davidhl@node] sudo RUST_BACKTRACE=full ~/Dev/serverless/snapfaas/firerunner/target/release/controller -k ~/Dev/serverless/snapfaas/firerunner/images/vmlinux-tty --runtimefs_dir ~/Dev/serverless/snapfaas/firerun
ner/images --appfs_dir ~/Dev/serverless/snapfaas/firerunner/images --requests ~/Dev/serverless/snapfaas/firerunner/bins/controller/example_requests.json -f ~/Dev/serverless/snapfaas/firerunner/bins/controller/ex
ample_func_configs.yaml > ctrr.log

example_requests.json is generated by generator.py to try to saturate my cloudlab node with our full set of applications. The machine has 164GB of memory, with 4 GB reserved (not available to VMs), it can run 1250 128MB-VMs concurrently. But VM creation fails after the workload runs for a few seconds. All of them point to the problem of "Too many open files". Here's the trace:

thread 'fc_vmm' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', src/libcore/result.rs:999:5
stack backtrace:
   0:     0x55814015065b - <unknown>
   1:     0x558140150337 - <unknown>
   2:     0x558140150dd0 - std::panicking::rust_panic_with_hook::hffcefc09751839d1
   3:     0x558140150952 - <unknown>
   4:     0x558140150836 - rust_begin_unwind
   5:     0x55814016c4fd - core::panicking::panic_fmt::h2daf88b2616ca2b2
   6:     0x55814000ecfe - <unknown>
   7:     0x55814001017d - <unknown>
   8:     0x55813ffedeb8 - <unknown>
   9:     0x558140034bac - <unknown>
  10:     0x55814001dafe - <unknown>
  11:     0x5581401543ba - __rust_maybe_catch_panic
  12:     0x55814000cf64 - <unknown>
  13:     0x55814014461f - <unknown>
  14:     0x558140153ae0 - <unknown>
  15:     0x7fb4278ad6ba - <unknown>
  16:     0x7fb4273cd41d - clone
  17:                0x0 - <unknown>

stderr is gabbled with multiple threads writing to it at the same time. But here are a few other error message:

thread 'fc_vmm' panicked at 'Cannot create snap event fd: Os { code: 24, kind: Other, message: "Too many open files" }', src/libcore/result.rs:999:5
thread 'main' panicked at 'Start: StartMicrovm(Internal, RegisterBlockDevice(CreateMmioDevice(Os { code: 24, kind: Other, message: "Too many open files" })))', src/libcore/result.rs:999:5
thread 'fc_vmm' panicked at 'Failed to clone write: Os { code: 24, kind: Other, message: "Too many open files" }', src/libcore/result.rs:999:5
stack backtrace:
   0:     0x55814015065b - <unknown>
   1:     0x558140150337 - <unknown>
   2:     0x558140150dd0 - std::panicking::rust_panic_with_hook::hffcefc09751839d1
   3:     0x558140150952 - <unknown>
   4:     0x558140150836 - rust_begin_unwind
   5:     0x55814016c4fd - core::panicking::panic_fmt::h2daf88b2616ca2b2
   6:     0x55814000ecfe - <unknown>
   7:     0x55813fff7602 - <unknown>
   8:     0x55813fffb73e - <unknown>
   9:     0x55813fff8b2a - <unknown>
  10:     0x558140034bed - <unknown>
  11:     0x55814001dafe - <unknown>
  12:     0x5581401543ba - __rust_maybe_catch_panic
  13:     0x55814000cf64 - <unknown>
  14:     0x55814014461f - <unknown>
  15:     0x558140153ae0 - <unknown>
  16:     0x7fb4278ad6ba - <unknown>
  17:     0x7fb4273cd41d - clone
  18:                0x0 - <unknown>
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Sys(EMFILE)', src/libcore/result.rs:999:  5
alevy commented 5 years ago

By default, ulimit on most linux's limits open file descriptors to something like 1K or 4K. Any chance we're hitting up against that limit?

LedgeDash commented 5 years ago

(seems that is what's happening here. The soft limit on my machine is 1024. just increased it to 10K. No panic so far. workload is taking a long time to run though).

Seems that the cause is Linux's limit. Followed this post in increasing the number of open files. So the fix is to config the machine and increase nofile limit.