sandstorm-io / blackrock

Cluster management
Apache License 2.0
74 stars 13 forks source link

FilesystemStorage hits concurrent assignment #18

Open zarvox opened 8 years ago

zarvox commented 8 years ago

I left a tab with Oasis open, with a single Wekan grain open, and didn't look at it for a while. I suspend/resumed my laptop, biked home, changed networks, and when I looked back at the tab in question, I see this exception:

Error: Error: remote exception: remote exception: remote exception: remote exception: remote exception: Assignable modified concurrently
C++ location:(javascript):??
type: disconnected

A little digging suggests this is from https://github.com/sandstorm-io/blackrock/blob/master/src/blackrock/fs-storage.c%2B%2B#L1559..L1575 :

      if (expectedVersion > 0) {
        if (object.version != expectedVersion) {
          return KJ_EXCEPTION(DISCONNECTED, "Assignable modified concurrently");
        }
        ++expectedVersion;
      }

pixels1

kentonv commented 8 years ago

By any chance did you have the grain open in multiple browser windows?

zarvox commented 8 years ago

I think I may have had it open at some point in there on a second (different profile) browser.

kentonv commented 8 years ago

OK, my guess is that both browsers asked the server to re-establish the connection at the same time (when the network came back), causing two different front-ends to try to start up the grain at the same time, and one of them lost the optimistic concurrency race.

To fix this, we need to insert a retry somewhere.

paulproteus commented 8 years ago

FWIW I think I ran into this just now without having my https://oasis.sandstorm.io/grain/SHcGmtX3NaDHmisZqrNpsd/ grain open in multiple browsers.

paulproteus commented 8 years ago

Thanks hugely to @zarvox for doing the grep to find this string. I was wondering about it but didn't think to grep blackrock as well as sandstorm.

paulproteus commented 8 years ago

My laptop did go through a suspend-resume cycle, fwiw, so that's a similar story to @zarvox 's story.

zarvox commented 8 years ago

This happened again, except this time, I only had the grain open in one location. It's possible a different user who also has access to the grain had it open at the time this was triggered, but they do not appear to be connected to the grain (wekan shows currently-connected users) when I refreshed my browser window.

How likely do you think that there's something else going on here?

zarvox commented 8 years ago

The suspend-resume cycle seems to be a constant in this buggy setup.

kentonv commented 8 years ago

I'm 90% confident this is optimistic concurrency working as intended -- except that the caller is failing to retry as it should.