nats-io / nats-streaming-server

NATS Streaming System Server
https://nats.io
Apache License 2.0
2.51k stars 283 forks source link

strange subscription behavior after the channel recovery #1264

Closed nvol closed 2 years ago

nvol commented 2 years ago

having the corrupted dataset (mentioned in issue #1261), we noticed some weird behavior after the channel recovery process
the channel contains a lot of messages but after recovery we can see (subscribe from the very first message) only 24 messages
when we trying to publish to that channel, we receive the messages we just published only when the subscription is "on air" at the moment
the new messages cannot be achieved when subscriber restarted - it still can see only the first 24 messages from the channel
the "start from the last message" subscription doesn't show the last published message as well
if we use the option MaxAge = 144h, we cannot get any message at all, unless we publish messages while subscriber is alive (it's ok because those 24 messages are stale)

kozlovic commented 2 years ago

I need to run some tests, but my guess is that the use of RecordSizeLimit actually corrupted your files because you set it to a low value and that truncated the file. If that is the case, I will probably revert adding this new option.

The original poster of the issue that triggered adding this flag had legitimate corruption that caused the server to allocate - at one point - 1GB or so of ram to try to read a record (that was not that big). The memory in-use was then much lower, but until the memory was reclaimed, it look like the process was using too much memory.

nvol commented 2 years ago

we set up several experiments with that dataset and found a strange thing...
the sequence counting in that channel resets many times!
it goes from 1 to 24 and then resets to 1 again, then it resets again and again during all the dat-file
finally it stops on seq 61
and the very last 1-61 sequence has the corrupted offsets in idx-file!
we still don't know the reason but would really like to reproduce both of these situations!

kozlovic commented 2 years ago

Again, I think that it is because of the recordSizeLimit that messed-up the state since the server tried to truncate the file (based on the log files that you sent with 2491 and 20M limit). I need to run some tests, but it looks like it was a really bad idea to add the record_size_limit config. Need more investigation, but if that turns out to be the reason, then I will pull (good thing I did not release yet with this new flag).

nvol commented 2 years ago

image the problem with huge memory allocation could be avoided if we would check that offsets in the idx-file are always increasing
on this screenshot we can see that idx file is corrupted and need to be rebuilded! but we go on and try to read the recSize by the corrupted offset and get the huge recSize in the result
by the way, the dat-file is normal - all packets with messages are on their places (if don't take into account seq resetting mentioned above)

kozlovic commented 2 years ago

If you know that the idx file is corrupted, but the data is fine, then delete the idx file, and it will be reconstructed by the server based on the dat file (it is slower since the server has to read each record in the dat file, but that will re-create the idx file).

kozlovic commented 2 years ago

Ok, the .dat file has likely many messages, but as you have seen, the sequence reset many times. I suspect this happened when you were removing the server.dat, because after a restart without the server.dat, the channel is not recovered (as you know), so when publishing new messages, they are added to the existing file but as a new sequence. I can reproduce with this example:

So at this point, I think your data file is bad and I don't see a way to fix that. To prevent that from happening in the future, we would need a resolution for #1263.

nvol commented 2 years ago

i'm totally sure that seq resetting wasn't caused by removing server.dat file
besides that corrupted dataset, i have another one (and both of them were not created artificially) and there are also a lot of seq jumps... seems that those jumps occur on hard machine reboots (it's just my guess but this hypothesis is quite real indeed!)
a lot of jumps observed at the junction of the day

some seq-jump examples:

got header at offset 0x002c65c1: recSz=0x000002ab seq:830 dt:2022-06-10_21:51:14 got header at offset 0x002c6874: recSz=0x0000029d seq:1 dt:2022-06-11_05:37:09

got header at offset 0x00393083: recSz=0x0000022e seq:564 dt:2022-06-12_05:31:17 got header at offset 0x003932b9: recSz=0x0000029d seq:1 dt:2022-06-12_05:35:37

got header at offset 0x003ac855: recSz=0x0000022d seq:156 dt:2022-06-13_05:30:52 got header at offset 0x003aca8a: recSz=0x0000029e seq:1 dt:2022-06-13_05:36:05

got header at offset 0x003c7f4d: recSz=0x0000022e seq:162 dt:2022-06-14_05:30:50 got header at offset 0x003c8183: recSz=0x0000029e seq:1 dt:2022-06-14_05:36:03

got header at offset 0x005c61f9: recSz=0x000002ab seq:1275 dt:2022-06-14_21:50:49 got header at offset 0x005c64ac: recSz=0x0000029e seq:1 dt:2022-06-15_05:36:51

nvol commented 2 years ago

we reproduced the situation when seq starts with 1 on every server restart (on Windows):

maybe it would be better (at least on Windows) do not distinguish the so called similar channels (like 'beard' and 'BEARD') between each other?
for example, all of them could be pre-converted to their lower-case variants
i'm almost sure that the most cases when such situations occur, this is nothing but a human factor!

the problem #1265 is near the same

nvol commented 2 years ago

moreover now i think that the recovery process mentioned in the title of the issue does not cause the problem (of the sequence jumps) indeed
think we can leave only one general issue about Windows filepath namings, similar channel names and sequence jumps

kozlovic commented 2 years ago

If you have a file that shows the seq resets, you can add some debug statement to see the recovered message subject (I do send to different spelling channels as you suggest in the other issue) and then stop the server, remove the msgs.1.idx file (to cause the recovery from the .dat file) and I can see that messages have different subjects. At this line: https://github.com/nats-io/nats-streaming-server/blob/2f0c58210356bff4640890b630596af1a62d6503/stores/filestore.go#L2429, add something like:

fmt.Printf("msg: seq=%v subject=%s\n", msg.Sequence, msg.Subject)
nvol commented 2 years ago

not exactly i'm about the case when: dir named "boo", but all the messages in dat-file belong to channel "Boo" -- seq resets for them on every server restart

kozlovic commented 2 years ago

Not at every restart, every time a message is produced on the different spelling. My point was that on restart, if you remove the ".idx" file, and you print the info about the message, you should see that they have different subjects.

But I don't think this is the case with the one that had the idx file corruption and user of record_size_limit=2491, is it?

nvol commented 2 years ago

Not at every restart, every time a message is produced on the different spelling. My point was that on restart, if you remove the ".idx" file, and you print the info about the message, you should see that they have different subjects.

yes, i see when i published different-spelled messages, i saw that each individual channel has it's own seq numeration and it looks good until you trying to subscribe to these channels (with flag 'all', for example) -- the subscription works really strange in those cases

But I don't think this is the case with the one that had the idx file corruption and user of record_size_limit=2491, is it?

the case when we got the corruption is concerned with the idx-file checking on server start
the corruption occurred when the offsets (for the dat-file) in idx-file jumped to addr 0x00000004 (ok, will call it 'offset resetting')
please take a look at the screenshot of idx-file above (https://github.com/nats-io/nats-streaming-server/issues/1264#issuecomment-1195716220) i really have no idea why this situation had place!

nvol commented 2 years ago

in every idx packet we have the offset to the dat-file and the record size
if we could be sure that ONE dat-file contains the information ONLY about the ONE channel, we would calculate and check every next offset, but we cannot do this now on Windows, because one dat-file can contain the information about several channels simultaneously,,, and it's kind of awkward...

so we check the offsets by extracting the header from dat file at given address, achieve the rec size (as we think) and then try to allocate memory by this value, having no suspicions, that it's corrupted

kozlovic commented 2 years ago

Closing this issue since the root cause was identified as issue #1265