nats-io / nats-streaming-server

NATS Streaming System Server
https://nats.io
Apache License 2.0
2.51k stars 283 forks source link

Startup fails if the store is corrupted #557

Closed dominic-pockit closed 6 years ago

dominic-pockit commented 6 years ago

We use a NATS streaming server instance in our production environment hosted on an Azure VM, for receiving batches of logs from around the rest of the platform. Over the weekend, Microsoft had some hardware issues at one of their data centers, which suddenly look our VM offline (we aren't yet running it in a high-availability configuration).

We also run the server with file sync disabled to improve performance, so messages being written to the store is being deferred to be done in batches. I believe that the hardware failure mentioned above happened to coincide with the batch of messages being written to disk, and as a result the store was corrupted.

When the VM came online again, the service failed to start because of an "Unexpected EOF" error when recovering state. Personally, I feel the service should be able to start, regardless of whether it is able to recover the state from the store.

Is there a configuration option I'm missing here? Ideally I would have liked it to take a backup of the store it was unable to read, and create a new store instead (naturally, with appropriate logging).

kozlovic commented 6 years ago

@dominic-pockit Apologies for the unpleasant experience. We need to have a way to recover from corrupted stores indeed. Not sure which way to go: a separate tool or a flag to force the start of the server.

There are different type of corrupted records that have different consequences for the server. For instance, a subscription or connection record could simply be skipped. But a message record cannot since we need to keep message sequence ordering. That is, the stores could not be "compacted" with the removal of a record, that record would somehow have to be marked as invalid and skipped, but not sure how that would be presented to the user.

Also, the issue is if a record is corrupted to the point where it makes it "impossible" for the file store to know if the next one is fine or not, what do we do? That is, a record has a size, if that value is wrong, the file store will jump in the middle of another record and not able to recover that too, etc.. Hopefully, we have the index file that contains also the size of the record and the index file is made of fixed size records, so with the combination of the two we may be able to be able to detect better if only one or more records are corrupted.

I will leave this issue opened and try to come up with a viable solution. Any feedback that you may have is welcomed.