%ames reload on ~zod didn't sync to network

urbit / arvo

https://github.com/urbit/urbit

110 stars 58 forks source link

%ames reload on ~zod didn't sync to network #501

Closed belisarius222 closed 6 years ago

belisarius222 commented 6 years ago

Trying a |reset now to force the issue, but this feels broken. Not sure what went wrong here.

ohAitch commented 6 years ago

Talk seems to be crashing in response to an update hall is sending in response to the merge(?), thereby preventing the merge from occurring. @fang- help

On Wed, Dec 13, 2017 at 6:32 PM, Ted Blackman notifications@github.com wrote:

Trying a |reset now to force the issue, but this feels broken. Not sure what went wrong here.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/501, or mute the thread https://github.com/notifications/unsubscribe-auth/ABxXhgrvwvJee65BmP3WYcJZFobWCv9-ks5tAIjFgaJpZM4RBded .

belisarius222 commented 6 years ago

Also, it should be noted that performing a |reset on ~zod will not actually cause anything to sync to other ships. Only Clay changes will cause that to happen, and |reset doesn't modify Clay.

Fang- commented 6 years ago

If talk's just printing a stacktrace pointing to line 461, then that is a known issue. Shouldn't actually break anything though.

Didn't see this stack trace in ~zod's scrollback, and its talk seems to be working fine (it can send messages to its own inbox and have those show up), so I'm not sure what's causing the sync to not go through. Will investigate further.

Fang- commented 6 years ago

It seems ~marzod and ~samzod got the update. (I can type ;sources %something into their talks.)
|mergeing their %kids desk into an old comet applies the update just fine. I'm assuming that their planets got the update as well.

~wanzod and ~binzod have not gotten the update, and are both showing the mentioned stacktrace along with the following:

  [%swim-take-vane %g %unto ~]
  [%gall-take /~wanzod/use/talk/~wanzod/out/hall/server/inbox]
[%fail-ack 0]
crude: all delivery failed!
  [%swim-take-vane %g %unto ~]
  [%gall-take /~wanzod/use/talk/~wanzod/out/hall/server/inbox]
crude: punted

I'll get to work on finding a reproduction case and fixing that talk error.

Fang- commented 6 years ago

Breaking syncs in four easy steps:

Boot up a fakezod and accompanying star. Latest arvo is fine.
On the star, send a message. Default audience (inbox) is fine.
On the star, run :talk 'screw' in dojo. This should print 'screwing things up...'. This will cause the below to runtime error because of an out-of-bounds snag.
On the fakezod, make the following change to /app/talk.hoon:

  [~ ..prep(+<+ u.old)]

in ++prep, on or around line 123, needs to become:

  :_  ..prep(+<+ u.old)
  :~  [0 %pull /server/client server ~]
      [0 %pull /server/inbox server ~]
      peer-client
      peer-inbox
  ==

Then watch as the fakezod pushes the change to the star, and the star spits some errors. Note how changing a file on fakezod now no longer pushes it down to the star.

For completeness, the errors it spits out (excluding the talk runtime error):

  [%swim-take-vane %g %unto ~]
  [%gall-take /~palzod/use/talk/~palzod/out/hall/server/inbox]
[%fail-ack 0]
crude: all delivery failed!
  [%swim-take-vane %g %unto ~]
  [%gall-take /~palzod/use/talk/~palzod/out/hall/server/inbox]
crude: punted
sync succeeded from %base on ~palzod to %kids

This is not the exact same thing as what happened on the stars. The error messages were slightly different there (%c %writ instead of %g %unto, also saying %not-actually-merging) and they were left in a state where |syncs didn't give the usual 3-item list output.
Since this also stops syncs from working properly though, I figured this might be a case worth investigating further.

Maybe interesting to note that doing an |merge %base ~zod %kids on the star causes an %already-merging-from-somewhere error. Doing an |unsync %base ~zod %kids, and then doing a |sync for that again causes an endless stream of them.

Fang- commented 6 years ago

And to get the printfs that also appeared on the stars:

Do the above
:talk 'rebuild' from dojo to resolve the talk stacktrace
|cancel %base to cancel the merge that apparently got stuck
Wait for syncing to happen. Should print:

sync succeeded from %base on ~palzod to %kids
sync succeeded from %base on ~palzod to %home
sync succeeded from %kids on ~zod to %base
  [%swim-take-vane %c %writ ~]
  %not-actually-merging
[%fail-ack 0]
crude: all delivery failed!
  [%swim-take-vane %c %writ ~]
  %not-actually-merging
crude: punted
sync succeeded from %base on ~palzod to %kids
sync succeeded from %base on ~palzod to %home
sync succeeded from %kids on ~zod to %base
  [%swim-take-vane %c %writ ~]
  %not-actually-merging
[%fail-ack 0]

belisarius222 commented 6 years ago

Should we separate the notification of file change to a new event? That way the sync would still succeed; it would just cause the next event to fail, without leaving Clay in a weird state. Conceptually, I'm not sure a filesystem merge should fail just because the result caused an error in a different part of the system.

ohAitch commented 6 years ago

imo the merge from kids to base should succeed and the merge from base to home should fail, but I guess we already crossed this bridge upon disabling the similar userspace behavior of "non-compiling app source code changes are not accepted"

On Mon, Dec 18, 2017 at 4:15 PM, Ted Blackman notifications@github.com wrote:

Should we separate the notification of file change to a new event? That way the sync would still succeed; it would just cause the next event to fail, without leaving Clay in a weird state. Conceptually, I'm not sure a filesystem merge should fail just because the result caused an error in a different part of the system.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/501#issuecomment-352599112, or mute the thread https://github.com/notifications/unsubscribe-auth/ABxXhq1nJ1PjpccmASRxLzkf4Je1RO_pks5tBwAIgaJpZM4RBded .

belisarius222 commented 6 years ago

Ok, let's see if I understand you correctly:

In an older version of the system, if you tried to merge in app source code that failed to compile, the merge would failed. That was changed, and now if you merge in app source code that fails to compile, the merge still succeeds. In keeping with this more permissive attitude, it would be consistent to also allow the parent->child syncing to succeed even if it causes userspace code to fail.

Did I get that right? If so, could you explain your reasoning behind wanting the merge from base to home to fail?

I also have a slight worry about potential race conditions if the callbacks for a filesystem sync could happen after some other events. What if those callbacks need to update some other part of the Arvo state? If some other event runs between the merge and those callbacks, then it could potentially read from both Clay and the other part of the system that was supposed to match, but instead read from them in an inconsistent state.

For example, let's say there was an app that kept as part of its state a tally of the number of files in a Clay desk. It's subscribed to notifications on this desk so it can rerun this tally whenever the desk changes. If a sync succeeds that changes the number of files in a desk, then before the app gets notified of the sync, some other event could run that accesses the app's tally and also accesses Clay. In that case, the tally wouldn't match the number of files in Clay.

It's a contrived example, so I'm not sure if this is actually something to worry about.

cgyarvin commented 6 years ago

Ted that’s exactly right.

Sent from my iPhone

On Dec 18, 2017, at 4:15 PM, Ted Blackman notifications@github.com wrote:

Should we separate the notification of file change to a new event? That way the sync would still succeed; it would just cause the next event to fail, without leaving Clay in a weird state. Conceptually, I'm not sure a filesystem merge should fail just because the result caused an error in a different part of the system.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

cgyarvin commented 6 years ago

The inserted event is a transaction. It can fail. If it fails, the C layer must generate a fail event. So we at least know %gall is out of sync.

Sent from my iPhone

On Dec 18, 2017, at 4:37 PM, Ted Blackman notifications@github.com wrote:

Ok, let's see if I understand you correctly:

In an older version of the system, if you tried to merge in app source code that failed to compile, the merge would failed. That was changed, and now if you merge in app source code that fails to compile, the merge still succeeds. In keeping with this more permissive attitude, it would be consistent to also allow the parent->child syncing to succeed even if it causes userspace code to fail.

Did I get that right? If so, could you explain your reasoning behind wanting the merge from base to home to fail?

I also have a slight worry about potential race conditions if the callbacks for a filesystem sync could happen after some other events. What if those callbacks need to update some other part of the Arvo state? If some other event runs between the merge and those callbacks, then it could potentially read from both Clay and the other part of the system that was supposed to match, but instead read from them in an inconsistent state.

For example, let's say there was an app that kept as part of its state a tally of the number of files in a Clay desk. It's subscribed to notifications on this desk so it can rerun this tally whenever the desk changes. If a sync succeeds that changes the number of files in a desk, then before the app gets notified of the sync, some other event could run that accesses the app's tally and also accesses Clay. In that case, the tally wouldn't match the number of files in Clay.

It's a contrived example, so I'm not sure if this is actually something to worry about.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

ohAitch commented 6 years ago

yeah, and the actual correct behavior here would be to abort the transaction for the merge, bouncing all the way back to kiln which is trying to sync things, and getting printed there; but over the past few years this capability of the system has lapsed considerably, and the best remaining choice is permissiveness

On Mon, Dec 18, 2017 at 4:40 PM, cgyarvin notifications@github.com wrote:

The inserted event is a transaction. It can fail. If it fails, the C layer must generate a fail event. So we at least know %gall is out of sync.

Sent from my iPhone

On Dec 18, 2017, at 4:37 PM, Ted Blackman notifications@github.com wrote:

Ok, let's see if I understand you correctly:

In an older version of the system, if you tried to merge in app source code that failed to compile, the merge would failed. That was changed, and now if you merge in app source code that fails to compile, the merge still succeeds. In keeping with this more permissive attitude, it would be consistent to also allow the parent->child syncing to succeed even if it causes userspace code to fail.

Did I get that right? If so, could you explain your reasoning behind wanting the merge from base to home to fail?

I also have a slight worry about potential race conditions if the callbacks for a filesystem sync could happen after some other events. What if those callbacks need to update some other part of the Arvo state? If some other event runs between the merge and those callbacks, then it could potentially read from both Clay and the other part of the system that was supposed to match, but instead read from them in an inconsistent state.

For example, let's say there was an app that kept as part of its state a tally of the number of files in a Clay desk. It's subscribed to notifications on this desk so it can rerun this tally whenever the desk changes. If a sync succeeds that changes the number of files in a desk, then before the app gets notified of the sync, some other event could run that accesses the app's tally and also accesses Clay. In that case, the tally wouldn't match the number of files in Clay.

It's a contrived example, so I'm not sure if this is actually something to worry about.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/501#issuecomment-352603178, or mute the thread https://github.com/notifications/unsubscribe-auth/ABxXhkCo1o0wEXoewmgJeIOOp5gTR9ncks5tBwYEgaJpZM4RBded .

belisarius222 commented 6 years ago

My worry is about the case where the sync succeeds, but another (previously scheduled) event gets run before the event that runs the notifications of the Clay change. At that point, the Clay change itself has been persisted, but not the effects of the Clay change. That wouldn't necessarily cause a nock error in the intruding event; it might just do something subtly wrong because it expected the Gall state to be up to date with the Clay state.

Arguably, ensuring all callbacks have been run at any time for a given Clay revision is merely a guarantee the system doesn't provide, so we could mention it in the docs as a gotcha. Unless that would break something ...

Anton says because of this potentially out-of-sync state, we should consider the callbacks to be part of the merge transaction; if they fail, the merge fails. This is a coherent argument, but I'm not so sure it's the best approach in practice, because it's possible for a sync to cause an error that's only tenuously related to the initial file merge. It could be thought of as a matter of priority: if we have to pick either the app updating successfully, or the merge succeeding, which do we choose? In the case of stars, it's definitely annoying that some pesky userspace code could clog up network-wide updates.

The reason to have transactional semantics is to avoid inconsistent states. For the past several days, @Fang- and I have been struggling to push updates to a network that's in an inconsistent state -- it's not about to corrupt data, but different machines are running different code, even though ~zod thinks its syncing job is done. The network is stuck mid-sync, in such a way that ~zod is now incapable of saving its children by pushing down new code. Since we use filesystem syncing to rescue children, this suggests to me that the success of filesystem syncs should take precedence over app code running successfully, or that we need an extra layer of transaction around syncing so that ~zod's push to its %kids desk fails when that tries to sync to another machine. The latter feels brittle to me, though, so I think we'd be better off limiting the filesystem syncing transaction so that userspace errors don't stop it halfway.

If there's a better way to think about this, I'm open to ideas.

cgyarvin commented 6 years ago

I agree that the sync should work no matter what. If a sync that pushes code pushes something that breaks, the old code should just keep running, NBD.

That means we do have to treat a “build broken” state as normal. Just try again on the next desk change.

Sent from my iPhone

On Dec 18, 2017, at 5:15 PM, Ted Blackman notifications@github.com wrote:

My worry is about the case where the sync succeeds, but another (previously scheduled) event gets run before the event that runs the notifications of the Clay change. At that point, the Clay change itself has been persisted, but not the effects of the Clay change. That wouldn't necessarily cause a nock error in the intruding event; it might just do something subtly wrong because it expected the Gall state to be up to date with the Clay state.

Arguably, ensuring all callbacks have been run at any time for a given Clay revision is merely a guarantee the system doesn't provide, so we could mention it in the docs as a gotcha. Unless that would break something ...

Anton says because of this potentially out-of-sync state, we should consider the callbacks to be part of the merge transaction; if they fail, the merge fails. This is a coherent argument, but I'm not so sure it's the best approach in practice, because it's possible for a sync to cause an error that's only tenuously related to the initial file merge. It could be thought of as a matter of priority: if we have to pick either the app updating successfully, or the merge succeeding, which do we choose? In the case of stars, it's definitely annoying that some pesky userspace code could clog up network-wide updates.

The reason to have transactional semantics is to avoid inconsistent states. For the past several days, @Fang- and I have been struggling to push updates to a network that's in an inconsistent state -- it's not about to corrupt data, but different machines are running different code, even though ~zod thinks its syncing job is done. The network is stuck mid-sync, in such a way that ~zod is now incapable of saving its children by pushing down new code. Since we use filesystem syncing to rescue children, this suggests to me that the success of filesystem syncs should take precedence over app code running successfully, or that we need an extra layer of transaction around syncing so that ~zod's push to its %kids desk fails when that tries to sync to another machine. The latter feels brittle to me, though, so I think we'd be better off limiting the filesystem syncing transaction so that userspace errors don't stop it halfway.

If there's a better way to think about this, I'm open to ideas.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Fang- commented 6 years ago

I'm going to be picking this up. Just discussed with @belisarius222 to make sure I understand everything correctly. For the record, we settled on just making clay set a behn timer for itself 0 seconds in the future whenever it wants to send out file-change notifications, and then sending those notifications from the corresponding ++wake instead.

The file tally example @belisarius222 mentioned should probably be documented as a gotcha.

Fang- commented 6 years ago

See #611 for further discussion of related issue.