rocicorp / repc

The canonical Replicache client, implemented in Rust.
Other
30 stars 7 forks source link

DBReadError(MapLoadError(CorruptChunk(Corrupt("missing key")))) #351

Open phritz opened 3 years ago

phritz commented 3 years ago

https://rocicorp.slack.com/archives/C01JJGGS6CU/p1621426062298400

UnhandledRejection
Non-Error promise rejection captured with value: DBReadError(MapLoadError(CorruptChunk(Corrupt(“missing key”))))
Pull returned: PullFailed(FetchFailed(RequestTimeout(TimeoutError { _private: () })))
logger: console
arguments: [“Pull returned: PullFailed(FetchFailed(RequestTimeout(TimeoutError { _private: () })))“]
phritz commented 3 years ago

image

phritz commented 3 years ago

fyi 11 occurrences over 8 users

phritz commented 3 years ago

In debugging this I discovered a separate annoyance: https://github.com/rocicorp/repc/issues/354

phritz commented 3 years ago

The line throwing the error is here: https://github.com/rocicorp/repc/blob/0319a6844c1abeede23763c87dd45a9083bf400f/src/prolly/leaf.rs#L42. The key in the leafentry proto is None. This is happening when we go do an opentransaction and read the main head, the main head chunk is corrupt in this way. However here's where we create the proto and it does not look possible for it to write None: https://github.com/rocicorp/repc/blob/0319a6844c1abeede23763c87dd45a9083bf400f/src/prolly/leaf.rs#L59. I can't find anywhere else where we construct this proto (other than tests). I also don't see how there could be a replicache-level bug in how we read the proto which is here: https://github.com/rocicorp/repc/blob/0319a6844c1abeede23763c87dd45a9083bf400f/src/prolly/leaf.rs#L27. We're just iterating the entries in the proto, there's literally nothing else going on.

I don't see a pattern with what happens in the logs just before it hits this error, other than pushes and pulls completing just before. The 18 occurrences of the error were not limited to one user, they were spread across 14 users.

I'm wondering if it really is the chunk's bytes being corrupted somehow. But that's a bit of a stretch: the data have to be corrupted in such a way that it still parses correctly as a proto. There are no other map load or corrupt chunk errors other than this one. If it were being corrupted with random data I would expect at least some of the time for it not to parse at all. But we don't see that. Perhaps the data is being partially written? Or partially overwritten?

Something that I did notice is that 18 out of 18 occurrences of this error are on Chrome Mobile 91.0.4472, which I think is a newish version. (They are 89% Chrome Mobile 91.0.4472 and Chrome Mobile WebView 91.0.4472). @arv @aboodman is there a clue in that maybe? Seems a pretty clear indicator of... something.

As for what to do next I'm open to suggestions but thinking:

  1. Improve the logging/error so that we get the chunk hash and bytes when this happens, and then get it into users hands if we can.
  2. Go through the flatbuffers bug reports and see if anything jumps out.
  3. Carefully read the memstore and prolly map code to see if there's anything that jumps out. For example I can imagine if a map entry gets aliased and is accessed without synchronization then we could read a partially written value. (But rust should make this hard, so....).
phritz commented 3 years ago

Suggestion from aaron which i think is good: try to craft the minimal byte array that yields this error.