Closed fazalmajid closed 3 years ago
I think this makes sense, great catch. Do you want to open a pull request to submit the change?
Sorry for the slow response, I checked in the PR https://github.com/nsqio/go-diskqueue/pull/31 with the fix
thanks, let's continue the discussion there
We found a data loss/corruption issue in nsqd 0.3.7 before diskqueue was split in its own module, but AFAIK it is still present in diskqueue.
If a diskqueue's metadata is not in sync with the data file (e,.g nsqd was terminated abruptly before it had the chance to sync metadata), on restart the diskqueue will start reading messages from d.readPos, and writing at d.writePos from the metadata, either or both not the end of file. At some point incoming messages will overwrite messages still being read, the read buffer will have a part of the old message then read the new message, and this race will cause corruption (because the next message size will be reading essentially random data from the new message), cause lost messages and the file being marked as bad.
Our fix was to check the file size against the metadata and if d.writePos is < the file size, we force rotation to a new file, while allowing reading messages to the end of file (possibly causing duplicate messages, but that's better than losing messages and NSQ has no guarantees of once-only delivery anyway):
This diff is against the 0.3.7 diskqueue.go, but should be readily transposable to go-diskqueue.