Closed gerhard closed 3 years ago
Thanks for the tips @kjnilsson !
I’ve got a feeling the only correct / practical way to resolve next is by resolving last then immediately read and throw away the last chunk. That should position next as expected. I’m really not sure how it works at all atm.
Also remember this error has also been observed with last.
I believe this is what happens:
next
is requested, scan_idx
returns the expected next chunk ID (last existing + 1) and eof
as its (expected) locationeof
is not idempotent - if you go to the eof
position, you may end up in different places every time, if the file is changing (which it is when publishers are running)scan_idx
returns and the file descriptor is positioned in the location returned by scan_idx
. If a new chunk is appended to the segment, eof
location in the segment file returned by scan_idx
is not the same location that is used to position the file descriptor. That's why we read an unexpected chunk.I see a few options:
eof
, scan_idx
could (probably) return the current size of the segment file as a numeric value. That would make this result idempotent - it's a specific location in the segment file.read_header0
(which is ultimately the function throwing the error) could ignore the expected ChunkId if the location is eof
(the client requested next
so there is no user expectation about what ID will be read first).I personally like the second option more because it solves another problem that I think we have - I don't think ChunkIds are guaranteed to be perfectly consecutive (if they are, ignore this issue). scan_idx
currently assumes that the next ChunkId will be exactly the last one + 1 (last existing chunk start + number of entries in that chunk). Even if scan_idx
returned the location in the segment as an integer, I believe it could happen that the ChunkId at that location won't be the expected ChunkId. That would lead to the same error, although much less often than what we see right now.
Lastly, I haven't tested this but I wonder what happens if the last existing chunk in the segment is actually the last one that will be in that segment. There could be another race condition where scan_idx
suggests the next
chunk will be in the same segment file, just after the currently last chunk, but in reality the next chunk will be written to a new segment file.
Ok so this PR is merged https://github.com/rabbitmq/osiris/pull/50
@acogoluegnes has done some local testing and could not repro this issue
@Gsantomaggio and I have been testing this on a 3 node cluster in the opportunity-99 env and cannot generate any errors including this even with very targeted consumer apps that periodically crash and re-attach using a random offset specification. We're happy that this issue is fixed / substantially improved.
Has https://github.com/rabbitmq/osiris/pull/50 been back ported to v3.9.x? If yes, we should add it to the release notes of the version that we expect it to ship in, e.g. https://github.com/rabbitmq/rabbitmq-server/blob/master/release-notes/3.9.6.md
That's a good point. I'm guessing we need a new tag for it and update 3.9.x accordingly?
Originally posted by @acogoluegnes in https://github.com/rabbitmq/opportunities/issues/99#issuecomment-910000375 (private repository)
I reproduced a couple of issues on a local node (not a cluster this time). It takes many more attempts to reproduce than with a cluster.
I also increased the rate of the publishers:
Consumer:
I got the same
unexpected_chunk_id
error:I got also another one,
invalid_chunk_header
:There is more detail here: https://github.com/rabbitmq/opportunities/issues/99#issuecomment-910043659