Rolling Hash for Op Stream Integrity and Consistency

anthony-murphy commented 3 years ago

Feature

Right now neither the sever or the client can validate the integrity of the ops stream. The client does attempt some minimal integrity checks, by ensuring we don't see the same seq with different data, but this isn't sufficient for all cases. For instance, there are hard to detect cases on the server in cases of rollover, or failback where it is hard to determine if any uncommitted operations were lost.

Describe the solution you'd like Add a rolling hash to the operation stream, such that is applied when operations are stamped. This gives both the server and client the ability to dectect out of band changes to the content of the opstream, as the rolling hash will not be constent across changes in the stream

anthony-murphy commented 3 years ago

@vladsud @tanviraumi @marcmasmsft

vladsud commented 3 years ago

crc32, I guess :) Too bad I do not see browser providing one. sha1 seems to be a bit of overkill.

One thing to consider - someone (Gary or Tanvir) mentioned that ops may get overwritten (from client POV) with same content but different timestamp. We should double click on that (and if timestamp should be there at all, and if it should be part of calculus). I find timestamp very useful for debugging, but wasteful from storage (especially taking into account compression)

anthony-murphy commented 3 years ago

looping in more folks @GaryWilber @curtisman

anthony-murphy commented 3 years ago

Right now we have similar but different concepts with term and epoch, and this will add a third with the rolling hash. I don't have a deep understanding of them all, but i think with epoch + rolling hash we may be able to drop term

vladsud commented 3 years ago

Worth pointing out how we attempt to do similar thing today in loader layer without any extra help from the service. This mechanism provides us certain protections similar to rolling hash, without adding cost of bigger ops or additional complexity on the service part. It's possible that it could be improved to have similar guarantees at a lower cost to the system.

If we have processed any ops, then we remember last processed op. And when fetching ops from storage we always fetch one extra op to overlap with last processed op. Then any time we have such overlap (including fetching ops from storage, or receiving duplicate ops from socket), we validate that key property of op being processed match previous op.

Op signature extraction:

    // returns parts of message (in string format) that should never change for a given message.
    // Used for message comparison. It attempts to avoid comparing fields that potentially may differ.
    // for example, it's not clear if serverMetadata or timestamp property is a property of message or server state.
    // We only extract the most obvious fields that are sufficient (with high probability) to detect sequence number
    // reuse.
    // Also payload goes to telemetry, so no PII, including content!!
    // Note: It's possible for a duplicate op to be broadcasted and have everything the same except the timestamp.
    private comparableMessagePayload(m: ISequencedDocumentMessage) {
        return `${m.clientId}-${m.type}-${m.minimumSequenceNumber}-${m.referenceSequenceNumber}-${m.timestamp}`;
    }

Op injection & comparison:

    // Validate that we do not have data loss, i.e. sequencing is reset and started again
    // with numbers that this client already observed before.
    if (this.previouslyProcessedMessage?.sequenceNumber === message.sequenceNumber) {
        const message1 = this.comparableMessagePayload(this.previouslyProcessedMessage);
        const message2 = this.comparableMessagePayload(message);
        if (message1 !== message2) {
            const error = new NonRetryableError(
                "Two messages with same seq# and different payload!",
                DriverErrorType.fileOverwrittenInStorage,
                {
                    clientId: this.connection?.clientId,
                    sequenceNumber: message.sequenceNumber,
                    message1,
                    message2,
                },
            );
            this.close(error);
        }

Fetching ops - fetchMissingDeltasCore():

    assert(lastKnowOp === this.lastQueuedSequenceNumber, 0x0f1 /* "from arg" */);
    let from = lastKnowOp + 1;

    const n = this.previouslyProcessedMessage?.sequenceNumber;
    if (n !== undefined) {
        // If we already processed at least one op, then we have this.previouslyProcessedMessage populated
        // and can use it to validate that we are operating on same file, i.e. it was not overwritten.
        // Knowing about this mechanism, we could ask for op we already observed to increase validation.
        // This is especially useful when coming out of offline mode or loading from
        // very old cached (by client / driver) snapshot.
        assert(n === lastKnowOp, 0x0f2 /* "previouslyProcessedMessage" */);
        assert(from > 1, 0x0f3 /* "not positive" */);
        from--;
    }

Places where it has gaps RE structural approach (of using rolling hash):

If snapshot does not have any trailing ops as part of snapshot (ODSP), or service does not use snapshots (FRS), then we do not have previous ops to compare against current state of service. Snapshot caching may make this problem worse, as time difference between fetching snapshot and fetching next op might be counted in days
- we could store last processed op as part of snapshot payload. Fetching ops prior to snapshot might fail as service might be deleting ops (though that should not happen today with 1-2 caching of snapshots and 30 days of ops always preserved in ODSP, but this is something to consider). That all said, it's not an issue for ODSP from other angle (or rather - risk of not detecting file overwrite is very small) - we do cache ops in a session, so prior session that cached snapshot also cached some number of ops and last of these ops would be compared to state of service per above at a later time.
There might be no op gap between current state of Container and first op delivered by socket. While we do preemptively fetch ops from storage as part of connection flow, storage response might come after first ops from socket, and this storage response will be ignored
- While this is definitely a gap, it's worth pointing out that majority of cases where file is overwritten / restored will result in substantial change to current sequence number. Chance of matching that with state of some client is rather low.

vladsud commented 3 years ago

And the parts that are common for rolling hash solution and solution described above (assuming fixing gaps), is that we need service to communicate to a client some extra data about previous op (or ops) as part of current op payload. The min bar here - PUSH has that info (and communicates it to both client & storage) whenever it returns ops to client or pushes ops to storage. The rest can be calculated / validated by storage & client to validate that op stream has never been altered / overwritten.

That all said, rolling hash will catch more cases, including file being overwritten in a way where some (last) op happened to be exactly the same as same op (same sequence number) of prior file content. Chances of that happening are extremely low (especially taking into account such things as time stamp and clientID that most ops has), but it's non-zero probability. Some aspects of that could be improved (like server messages that have no clientId include generated guid somewhere to increase entropy).

tanviraumi commented 3 years ago

Based on our primary discussion and Vlad's description above, it's safe to say that rolling hash will be a more reliable measure for detecting data corruption. In addition, rolling hash has the potential to replace "SPO Epoch" (@marcmasmsft will verify), so less number of variables to track data reliability.

From a service POV, development cost is not that much (@GaryWilber can give a better estimate for push). We will have to calculate the rolling hash (most probably CRC) in deli and include that in the op payload. When ops are uploaded to storage, it will include the hash that was sent during the previous upload. We will run one single well established algorithm which means clients/SPO can also calculate and verify on their side.

The bigger question is COGS. The industry standard is 8 bytes (although we might settle on 4 bytes since the corpus is limited), which means each op size will increase by 8 bytes. We also discussed storing hash for both previous and current ops, which will double that size. We need to think about this storage amplification carefully.

On CPU side, we now have to parse the content to hash it. Which means CPU/memory cost will increase as well. Note that, we do parse the content today but thats a side affect of using simple JSON.parse(). Server never looks into the content, which means we have the ability to optimize this flow and only parse partially (using a custom parser or binary schemas such as protobuf/avro). If we start calculating hash based on content, the door for this optimization will be closed forever, especially if SPO gets rid of epoch.

We may decide to not include the content field and hash everything else, but I am not sure how useful that hash would be. Keeping the question open to get others feedback.

marcmasmsft commented 3 years ago

It should replace epoch for ODSP; my thinking is that op uploads should include the sequence number from the last persisted op plus its rolling hash while summaries should include the hash for the ref sequence number (plus the handle of the latest summary being replaced). Create new calls should generate a (random?) starting hash for the file.

I'm unsure about the need to have both previous and current hash stored in the ops; it feels like we should only need the previous hash when uploading a range of ops, e.g. if uploading ops [101,200], then the call should include the hash of 100 for occ check; the ops themselves could expose the rolling hash up to and including their payload. Similar to sequence number today, the hash should be promoted when sending the payload to ODSP.

vladsud commented 3 years ago

Some questions to answer:

Is this optional feature? I.e. can storage provide not implement it?
- can / should it be per file choice? (maybe some specific modes like ODSP versions or something else decides to strip that data)
What is impact on space / COGS?
- it would be great to take some sample file and estimate it, including impact on compressed outcome, as this impacts both storage size and download payload sizes
Any impact on import/export of files from storage? Similar, there was (non-implemented) idea of summary overwriting a file (i.e. forcing all clients to reload from it, i.e. summary is not really a function of ops preceding it).
Is algorithm documented? Is it the same for all storages?
- Canonical representation of content - how do we define it / is it a requirement?

It would be great to start a (design) document where these and other questions could be answered.

vladsud commented 3 years ago

Per earlier discussion, we would love this be driven from server side. Once we have design and maybe reference implementation, we can have appropriate implementation on client side.

vladsud commented 2 years ago

At the moment, nobody is driving this work, so closing. Let's reopen when we can more clarify when to restart this work

microsoft / FluidFramework

Rolling Hash for Op Stream Integrity and Consistency #5689

Feature