Closed nvol closed 2 years ago
@nvol Agree on changes to AllOptions (will have a PR for that). But why do you think the test should be done only for the last message in the channel? If you set a record size limit, it will apply for all records, not just the last one. A corruption could occur anywhere in the file, and on recovery/load, the server may try to allocate an excessive amount of memory. This new config was added so that admin could set an upper limit that they know is more than the normal record size, but prevent the server from allocating too much memory.
we have a test dataset with the broken channel (corrupted filestore, will send the archive with data/ and logs/ via email) and when we run it with RecordSizeLimit equal to 20M, it does the recover (as if the channel was repaired but there are still some problems with subscriptions - will talk about it later) and server goes on working
but if the RecordSizeLimit is set to value 2941 (the channel has some big messages with recSize up to 2942 bytes), the server crashes with a message dump
is that how it was intended?
or maybe such messages should be just skipped?
but anyway it's very good that the server didn't crash on the damaged record in the end of file!
@nvol The intent of RecordSizeLimit is from the original poster of the issue that led to adding this new option (https://github.com/nats-io/nats-streaming-server/issues/1255): to prevent a corrupted record to cause the server to allocate memory based on the invalid record size (that is, a corruption that cause the bytes representing the record size to have a value such that it tells the server that the record is, say 2GB big, or something like that). This record size limit is applied to ALL records, not just the last, so that is expected behavior.
If you set the value lower than some valid records, you are going to cause the server to fail recovering this valid record, so again, that is expected. The value should be set to the biggest "valid" record size, plus some margin.
limit param is always 0 when passed to readRecord func it seems that RecordSizeLimit(opts.RecordSizeLimit)(o) is needed in func AllOptions(opts *FileStoreOptions) FileStoreOption {...}
but another problem (when this param passed) is that server crashes when there is a record of greater size found during the recovery process (note: it's not the last message in the channel):
{"level":"fatal","time":"2022-07-22T12:56:41+03:00","message":"STREAM: Failed to start: unable to recover message store for [SomeChannel.Log](file: \"SomeChannel.Log\\msgs.1.idx\"): record size 2942 is greater than limit of 2941 bytes"}