storage: cope more gracefully with exceeding open files ulimit

jcsp commented 3 years ago

Like all programs, Redpanda is subject to system limits on how many file handles we can have.

Currently we handle this as an IO error, and log an error something like ERROR 2021-09-15 14:30:25,567 [shard 0] storage - Error writing record batch std::__1::__fs::filesystem::filesystem_error (error system:24, filesystem error: open failed: Too many open files [/var/lib/redpanda/data/kafka/topic-etnm/12_18/1-16-v1.log])

What the user sees is undefined -- they're likely to see that certain topics become unresponsive as their log segments roll over and they can't open new ones. It's certainly not going to be externally obvious what is going on until the user searches for log errors.

JIRA Link: CORE-734

jcsp commented 3 years ago

Thoughts on mitigation:

Some process-wide metric for when we've hit the limit (and other system limits? can't think of any we hit regularly right now) and then error-out API calls up-front (ideally with a message expressing the file handle issue)
Expose a prometheus gauge of open files vs. the limit. As with free space #2166 this is something the user can monitor externally, but we cannot rely on the user using external tools. Services like Redpanda are usually the "canary in the coal mine" for things like ulimit issues.

jcsp commented 3 years ago

Noah:

I agree we should behave better. We can also be significant better OS citizens. We keep an fd open for every segment and on restarts everything gets rolled so it can really add up quickly in different situations. I’d really like to support lazy open for older segments. Even for a scan the world we’d only need one or two fd per client.

ajfabbri commented 3 years ago

Pasting more stack for search-ability:

ERROR 2021-10-17 08:02:25,033 [shard 1] assert - Assert failure: (../../../src/v/storage/log_reader.h:141) '!_iterator.reader' log reader destroyed with live reader                                                                                                                                                                                                       
ERROR 2021-10-17 08:02:25,033 [shard 7] storage - Error writing record batch std::__1::__fs::filesystem::filesystem_error (error system:24, filesystem error: open failed: Too many open files [/var/lib/redpanda/data/kafka/first-topic/13_11/0-1-v1.log])                                                                                                                ERROR 2021-10-17 08:02:25,033 [shard 5] storage - Error writing record batch std::__1::__fs::filesystem::filesystem_error (error system:24, filesystem error: open failed: Too many open files [/var/lib/redpanda/data/kafka/first-topic/123_11/0-1-v1.log])                                                                                                               
INFO  2021-10-17 08:02:25,033 [shard 2] storage - disk_log_appender.cc:102 - Could not append batch: std::__1::__fs::filesystem::filesystem_error (error system:24, filesystem error: open failed: Too many open files [/var/lib/redpanda/data/kafka/first-topic/56_11/0-1-v1.base_index]) - {offset_idx:{0}, active_segment:no, _bytes_left_in_segment:0, _base_offset:{0}, _last_offset:{-9223372036854775808}, _last_term:{1}, _byte_size:0}                                                                                                                                                                                                                                                                                                       
WARN  2021-10-17 08:02:25,033 [shard 7] raft - [group_id:14, {kafka/first-topic/13}] consensus.cc:1591 - Error occurred while appending log entries - std::__1::__fs::filesystem::filesystem_error (error system:24, filesystem error: open failed: Too many open files [/var/lib/redpanda/data/kafka/first-topic/13_11/0-1-v1.log])                                       
INFO  2021-10-17 08:02:25,033 [shard 0] storage - disk_log_appender.cc:102 - Could not append batch: std::__1::__fs::filesystem::filesystem_error (error system:24, filesystem error: open failed: Too many open files [/var/lib/redpanda/data/kafka/first-topic/102_11/0-1-v1.base_index]) - {offset_idx:{0}, active_segment:no, _bytes_left_in_segment:0, _base_offset:{0}, _last_offset:{-9223372036854775808}, _last_term:{1}, _byte_size:0}                                                                                                                                                                                                                                                                                                      
ERROR 2021-10-17 08:02:25,033 [shard 1] assert - Backtrace below:                                                                                                                                                                                                                                                                                                          
0x3231464 0x29ecf51 0x2a08aa8 0x2b6eedd 0x2fab1b4 0x2fae677 0x2ffe4bd 0x2f5c19f /opt/redpanda/lib/libpthread.so.0+0x9298 /opt/redpanda/lib/libc.so.6+0x1006a2                                                                                                                                                                                                              
   --------                                                                                                                                                                                                                                                                                                                                                                
   seastar::continuation<seastar::internal::promise_base_with_type<void>, storage::internal::rebuild_compaction_index(model::record_batch_reader, std::__1::__fs::filesystem::path, storage::compaction_config)::$_22, seastar::future<void> seastar::future<storage::compacted_index_writer>::then_impl_nrvo<storage::internal::rebuild_compaction_index(model::record_batch_reader, std::__1::__fs::filesystem::path, storage::compaction_config)::$_22, seastar::future<void> >(storage::internal::rebuild_compaction_index(model::record_batch_reader, std::__1::__fs::filesystem::path, storage::compaction_config)::$_22&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, storage::internal::rebuild_compaction_index(model::record_batch_reader, std::__1::__fs::filesystem::path, storage::compaction_config)::$_22&, seastar::future_state<storage::compacted_index_writer>&&), storage::compacted_index_writer>                                                                                                                                                                                         
INFO  2021-10-17 08:02:25,034 [shard 6] storage - segment.cc:628 - Creating new segment /var/lib/redpanda/data/kafka/first-topic/28_11/0-1-v1.log

ajfabbri commented 3 years ago

I'll grab this for now since I'm looking at #2166

dotnwat commented 3 years ago

segment index reader file description for all non-head-of-log segments could probably be reclaimed in lru order based on some limit. for example, we could probably query nofile limit on bootup to seed this value.

ajfabbri commented 2 years ago

@VadimPlh ran into this doing some testing with 100 topics of 3000 partitions each. First started seeing RPC timeouts:

[shard 3] rpc - transport.cc:236 - Request timeout, correlation id: <id>
and
[group_id:32604, {kafka/random_test_3/520}] vote_stm.cc:262 - unable to replicate configuration as a leader, stepping down

then assert failure, as expected:


2021-12-17T16:07:21.502228977Z stderr F ERROR 2021-12-17 16:07:21,501 [shard 7] assert - Assert failure: (../../../src/v/cluster/persisted_stm.cc:272) 'false' Can't load snapshot from '"/var/lib/redpanda/data/kafka/random_test_85/617_537/tx.snapshot"'. Got error: std::__1::__fs::filesystem::filesystem_error (error system:24, filesystem error: open failed: Too many open files [/var/lib/redpanda/data/kafka/random_test_85/617_537/tx.snapshot])

dotnwat commented 2 years ago

@VadimPlh ran into this doing some testing with 100 topics of 3000 partitions each. First started seeing RPC timeouts:

[shard 3] rpc - transport.cc:236 - Request timeout, correlation id: <id>
and
[group_id:32604, {kafka/random_test_3/520}] vote_stm.cc:262 - unable to replicate configuration as a leader, stepping down

then assert failure, as expected:


2021-12-17T16:07:21.502228977Z stderr F ERROR 2021-12-17 16:07:21,501 [shard 7] assert - Assert failure: (../../../src/v/cluster/persisted_stm.cc:272) 'false' Can't load snapshot from '"/var/lib/redpanda/data/kafka/random_test_85/617_537/tx.snapshot"'. Got error: std::__1::__fs::filesystem::filesystem_error (error system:24, filesystem error: open failed: Too many open files [/var/lib/redpanda/data/kafka/random_test_85/617_537/tx.snapshot])

was nofile limit set high enough?

ajfabbri commented 2 years ago

@dotnwat I should have mentioned that; the likely cause was nofile limit being too low. Just adding log messages for documentation.

VadimPlh commented 2 years ago

@VadimPlh ran into this doing some testing with 100 topics of 3000 partitions each. First started seeing RPC timeouts:

[shard 3] rpc - transport.cc:236 - Request timeout, correlation id: <id>
and
[group_id:32604, {kafka/random_test_3/520}] vote_stm.cc:262 - unable to replicate configuration as a leader, stepping down

then assert failure, as expected:


2021-12-17T16:07:21.502228977Z stderr F ERROR 2021-12-17 16:07:21,501 [shard 7] assert - Assert failure: (../../../src/v/cluster/persisted_stm.cc:272) 'false' Can't load snapshot from '"/var/lib/redpanda/data/kafka/random_test_85/617_537/tx.snapshot"'. Got error: std::__1::__fs::filesystem::filesystem_error (error system:24, filesystem error: open failed: Too many open files [/var/lib/redpanda/data/kafka/random_test_85/617_537/tx.snapshot])

was nofile limit set high enough?

The main issue (in my case) is I got it in my BYOC cluster in our cloud. So I am not sure can I change it. btw, we have thread for that in cloud channel

jcsp commented 2 years ago

Internal ticket for the cloud config is https://app.shortcut.com/vectorized/story/6531/byoc-nodes-have-insufficient-file-handle-limit-for-redpanda-s-max-partition-count

travisdowns commented 2 years ago

Some of the work to remove the 3x files open per segment is being done in #4022 and that specific issue is the subject of #4057.

redpanda-data / redpanda

storage: cope more gracefully with exceeding open files ulimit #2320