Open phritz opened 3 years ago
fyi 11 occurrences over 8 users
In debugging this I discovered a separate annoyance: https://github.com/rocicorp/repc/issues/354
The line throwing the error is here: https://github.com/rocicorp/repc/blob/0319a6844c1abeede23763c87dd45a9083bf400f/src/prolly/leaf.rs#L42. The key in the leafentry proto is None. This is happening when we go do an opentransaction and read the main head, the main head chunk is corrupt in this way. However here's where we create the proto and it does not look possible for it to write None: https://github.com/rocicorp/repc/blob/0319a6844c1abeede23763c87dd45a9083bf400f/src/prolly/leaf.rs#L59. I can't find anywhere else where we construct this proto (other than tests). I also don't see how there could be a replicache-level bug in how we read the proto which is here: https://github.com/rocicorp/repc/blob/0319a6844c1abeede23763c87dd45a9083bf400f/src/prolly/leaf.rs#L27. We're just iterating the entries in the proto, there's literally nothing else going on.
I don't see a pattern with what happens in the logs just before it hits this error, other than pushes and pulls completing just before. The 18 occurrences of the error were not limited to one user, they were spread across 14 users.
I'm wondering if it really is the chunk's bytes being corrupted somehow. But that's a bit of a stretch: the data have to be corrupted in such a way that it still parses correctly as a proto. There are no other map load or corrupt chunk errors other than this one. If it were being corrupted with random data I would expect at least some of the time for it not to parse at all. But we don't see that. Perhaps the data is being partially written? Or partially overwritten?
Something that I did notice is that 18 out of 18 occurrences of this error are on Chrome Mobile 91.0.4472, which I think is a newish version. (They are 89% Chrome Mobile 91.0.4472 and Chrome Mobile WebView 91.0.4472). @arv @aboodman is there a clue in that maybe? Seems a pretty clear indicator of... something.
As for what to do next I'm open to suggestions but thinking:
Suggestion from aaron which i think is good: try to craft the minimal byte array that yields this error.
https://rocicorp.slack.com/archives/C01JJGGS6CU/p1621426062298400