radicle-dev / radicle-link

The second iteration of the Radicle code collaboration protocol.
Other
421 stars 39 forks source link

git: ignore error due to concurrent sigrefs update #777

Closed kim closed 2 years ago

kim commented 2 years ago

When two Refs::update race, only one of them can succeed, because the ref update only succeeds if the expected parent commit is what the ref currently points to (ref updates are atomic). While there is a chance that the winning update commits a slightly outdated view of the refs tree, this is tolerable, because every modification of the namespace will trigger a sigrefs update, and thus the state converges.

Not so great, however, is to propagate the commit failure due to a modified parent as an error, as this may interrupt the replication flow, leaving the namespace in a potentially half-arsed state.

Therefore, teach Refs::update to ignore this kind of error, and return the most recent committed state instead.

Fixes #761 Signed-off-by: Kim Altintop kim@eagain.st

alexjg commented 2 years ago

When you say "state converges" on every update you mean that the next time the user does something which updates refs like committing or cloning, or whenever some network action causes replicate_signed_refs to be called?

kim commented 2 years ago

Yes. I think it’s even that a no-op push would update, because the remote helper cannot currently see if any tips were updated. It would probably be useful to have a UI which shows you what the current signed refs are, because that’s what’s considered “published”, not the actual refs.

alexjg commented 2 years ago

Yeah that's probably the best we can do.

Or we could just have a subroutine that calls Refs::update every 30 seconds 😈

kim commented 2 years ago

Under fair^H intended use this should not be necessary: pushing to the monorepo is interactive (and single-user normally), so you get a warning printed to your terminal if the signed refs update failed. A fetch (unattended) is supposed to be exclusive per namespace, so that cannot race either.

kim commented 2 years ago

Btw can you see how this whole thread shows how the categories “issue”, “pull request”, “discussion” are useless? I propose the terminology “project management theatre”.

geigerzaehler commented 2 years ago

Under fair^H intended use this should not be necessary: pushing to the monorepo is interactive (and single-user normally), so you get a warning printed to your terminal if the signed refs update failed. A fetch (unattended) is supposed to be exclusive per namespace, so that cannot race either.

In our test case the error originates from the daemon that does a fetch and not from git push. As you pointed out, the race is probably caused by a concurrent git push but we don’t see the warning there. This means we do end up in a situation where the daemon does not publish it’s latest state until something updates the refs. This may lead to confusing behavior for any users and is also less than ideal for our test setups.

I understand that this is a stop-gap fix but I would prefer to raise the log level to a warning so that we know something’s amiss when the tests don’t work.

kim commented 2 years ago

I understand that this is a stop-gap fix

No it's not. What made you think so?

geigerzaehler commented 2 years ago

I understand that this is a stop-gap fix

No it's not. What made you think so?

Shouldn’t we try to achieve a situation where the signed refs represent the latest state without requiring something else to force an update? This change doesn’t accomplish that.

kim commented 2 years ago

@geigerzaehler I think you're not understanding https://github.com/radicle-dev/radicle-link/pull/777#issuecomment-907598021: the test situation is not the same as the usage situation. Whether or not the background replication fails to update the signed refs is irrelevant to the user -- the user is supposed to use git push. Like any other git push, this can fail because of concurrent modification, the only difference is that we can not currently roll back the entire push. If upstream modifies the monorepo directly, that's your problem.

kim commented 2 years ago

In our test case the error originates from the daemon that does a fetch and not from git push.

Which would mean that the push won.

This means we do end up in a situation where the daemon does not publish it’s latest state until something updates the refs.

As reported in https://github.com/radicle-dev/radicle-link/issues/761#issuecomment-908399619, our global lock did not actually have the desired effect, so it could very well be that this is due to competing fetches.

When there are competing updates, I am not aware of a way to guarantee that the most recent one eventually wins -- the background process can attempt to retry, but that's not the same as a guarantee. With the fix in place, the only competition is between the background fetcher and a user-initiated push. If the push wins, there should eventually be a new fetcher, because of gossip, rere, etc. If the push loses, the user will receive a warning, and is supposed to retry. If programmatic pushes in test code don't handle this, they need to be fixed.

geigerzaehler commented 2 years ago

If the push wins, there should eventually be a new fetcher, because of gossip, rere, etc. If the push loses, the user will receive a warning, and is supposed to retry.

I was missing that. Together with fixing the global lock this should be sufficient to ensure proper publication or inform the user a push failed.

kim commented 2 years ago

If the push loses, the user will receive a warning, and is supposed to retry.

Wait no, we are not propagating the warning anymore with this change :facepalm:

We need to return an Either or something to be able to distinguish the possible outcomes.

kim commented 2 years ago

Unsure if CI now fails because of the mutex actually working :joy_cat:

kim commented 2 years ago

if CI now fails because of the mutex actually working

Yes it did, because the test did encode an explicit rere, which was rejected as already-in-flight. I am leaving the assertion in place (even though it conflates test concerns), because it is testing that concurrency mechanism. I am also too lazy right now to come up with a dedicated test.

kim commented 2 years ago

I am leaving the assertion in place

Well, scrap that. It is timing-dependent.