Closed anthony-murphy closed 2 years ago
@vladsud @tanviraumi @marcmasmsft
crc32, I guess :) Too bad I do not see browser providing one. sha1 seems to be a bit of overkill.
One thing to consider - someone (Gary or Tanvir) mentioned that ops may get overwritten (from client POV) with same content but different timestamp. We should double click on that (and if timestamp should be there at all, and if it should be part of calculus). I find timestamp very useful for debugging, but wasteful from storage (especially taking into account compression)
looping in more folks @GaryWilber @curtisman
Right now we have similar but different concepts with term and epoch, and this will add a third with the rolling hash. I don't have a deep understanding of them all, but i think with epoch + rolling hash we may be able to drop term
Worth pointing out how we attempt to do similar thing today in loader layer without any extra help from the service. This mechanism provides us certain protections similar to rolling hash, without adding cost of bigger ops or additional complexity on the service part. It's possible that it could be improved to have similar guarantees at a lower cost to the system.
If we have processed any ops, then we remember last processed op. And when fetching ops from storage we always fetch one extra op to overlap with last processed op. Then any time we have such overlap (including fetching ops from storage, or receiving duplicate ops from socket), we validate that key property of op being processed match previous op.
Op signature extraction:
// returns parts of message (in string format) that should never change for a given message.
// Used for message comparison. It attempts to avoid comparing fields that potentially may differ.
// for example, it's not clear if serverMetadata or timestamp property is a property of message or server state.
// We only extract the most obvious fields that are sufficient (with high probability) to detect sequence number
// reuse.
// Also payload goes to telemetry, so no PII, including content!!
// Note: It's possible for a duplicate op to be broadcasted and have everything the same except the timestamp.
private comparableMessagePayload(m: ISequencedDocumentMessage) {
return `${m.clientId}-${m.type}-${m.minimumSequenceNumber}-${m.referenceSequenceNumber}-${m.timestamp}`;
}
Op injection & comparison:
// Validate that we do not have data loss, i.e. sequencing is reset and started again
// with numbers that this client already observed before.
if (this.previouslyProcessedMessage?.sequenceNumber === message.sequenceNumber) {
const message1 = this.comparableMessagePayload(this.previouslyProcessedMessage);
const message2 = this.comparableMessagePayload(message);
if (message1 !== message2) {
const error = new NonRetryableError(
"Two messages with same seq# and different payload!",
DriverErrorType.fileOverwrittenInStorage,
{
clientId: this.connection?.clientId,
sequenceNumber: message.sequenceNumber,
message1,
message2,
},
);
this.close(error);
}
Fetching ops - fetchMissingDeltasCore():
assert(lastKnowOp === this.lastQueuedSequenceNumber, 0x0f1 /* "from arg" */);
let from = lastKnowOp + 1;
const n = this.previouslyProcessedMessage?.sequenceNumber;
if (n !== undefined) {
// If we already processed at least one op, then we have this.previouslyProcessedMessage populated
// and can use it to validate that we are operating on same file, i.e. it was not overwritten.
// Knowing about this mechanism, we could ask for op we already observed to increase validation.
// This is especially useful when coming out of offline mode or loading from
// very old cached (by client / driver) snapshot.
assert(n === lastKnowOp, 0x0f2 /* "previouslyProcessedMessage" */);
assert(from > 1, 0x0f3 /* "not positive" */);
from--;
}
Places where it has gaps RE structural approach (of using rolling hash):
And the parts that are common for rolling hash solution and solution described above (assuming fixing gaps), is that we need service to communicate to a client some extra data about previous op (or ops) as part of current op payload. The min bar here - PUSH has that info (and communicates it to both client & storage) whenever it returns ops to client or pushes ops to storage. The rest can be calculated / validated by storage & client to validate that op stream has never been altered / overwritten.
That all said, rolling hash will catch more cases, including file being overwritten in a way where some (last) op happened to be exactly the same as same op (same sequence number) of prior file content. Chances of that happening are extremely low (especially taking into account such things as time stamp and clientID that most ops has), but it's non-zero probability. Some aspects of that could be improved (like server messages that have no clientId include generated guid somewhere to increase entropy).
Based on our primary discussion and Vlad's description above, it's safe to say that rolling hash will be a more reliable measure for detecting data corruption. In addition, rolling hash has the potential to replace "SPO Epoch" (@marcmasmsft will verify), so less number of variables to track data reliability.
From a service POV, development cost is not that much (@GaryWilber can give a better estimate for push). We will have to calculate the rolling hash (most probably CRC) in deli and include that in the op payload. When ops are uploaded to storage, it will include the hash that was sent during the previous upload. We will run one single well established algorithm which means clients/SPO can also calculate and verify on their side.
The bigger question is COGS. The industry standard is 8 bytes (although we might settle on 4 bytes since the corpus is limited), which means each op size will increase by 8 bytes. We also discussed storing hash for both previous and current ops, which will double that size. We need to think about this storage amplification carefully.
On CPU side, we now have to parse the content to hash it. Which means CPU/memory cost will increase as well. Note that, we do parse the content today but thats a side affect of using simple JSON.parse(). Server never looks into the content, which means we have the ability to optimize this flow and only parse partially (using a custom parser or binary schemas such as protobuf/avro). If we start calculating hash based on content, the door for this optimization will be closed forever, especially if SPO gets rid of epoch.
We may decide to not include the content field and hash everything else, but I am not sure how useful that hash would be. Keeping the question open to get others feedback.
It should replace epoch for ODSP; my thinking is that op uploads should include the sequence number from the last persisted op plus its rolling hash while summaries should include the hash for the ref sequence number (plus the handle of the latest summary being replaced). Create new calls should generate a (random?) starting hash for the file.
I'm unsure about the need to have both previous and current hash stored in the ops; it feels like we should only need the previous hash when uploading a range of ops, e.g. if uploading ops [101,200], then the call should include the hash of 100 for occ check; the ops themselves could expose the rolling hash up to and including their payload. Similar to sequence number today, the hash should be promoted when sending the payload to ODSP.
Some questions to answer:
It would be great to start a (design) document where these and other questions could be answered.
Per earlier discussion, we would love this be driven from server side. Once we have design and maybe reference implementation, we can have appropriate implementation on client side.
At the moment, nobody is driving this work, so closing. Let's reopen when we can more clarify when to restart this work
Feature
Right now neither the sever or the client can validate the integrity of the ops stream. The client does attempt some minimal integrity checks, by ensuring we don't see the same seq with different data, but this isn't sufficient for all cases. For instance, there are hard to detect cases on the server in cases of rollover, or failback where it is hard to determine if any uncommitted operations were lost.
Describe the solution you'd like Add a rolling hash to the operation stream, such that is applied when operations are stamped. This gives both the server and client the ability to dectect out of band changes to the content of the opstream, as the rolling hash will not be constent across changes in the stream