Sync one repo at a time

amberin commented 1 year ago

✅ I have searched for existing issues that may be the same as or related to mine.

I have started to use multiple Git repos. It works well, but syncing is pretty slow, especially when there are multiple changed notebooks.

I haven't looked at the logic for the other repo types, but I suspect Git syncing could be made a lot faster if we would group notebooks by repo before syncing them. This would speed up Git syncing even when using only a single repo.

Today, sync means looping over all notebooks and syncing them one by one, in an order unrelated to their repo links. (I have moved git pushing out of this loop to save some time with git repos. Unfortunately, we still push once per changed notebook when there are only local changes, due to SyncRepo.storeBook being called both in this scenario and during force-saving. But that is a separate problem.)

Syncing notebooks grouped by repo would allow a much faster workflow for Git. We would probably not need to loop over each notebook, as Git already knows what has changed. And when there are changes, they could all go into a single commit. Having full control of the sync process of each repo would allow us to be more economical with fetch/push and possibly other time-consuming operations.

Looping over notebooks would obviously still be possible for all repo types; it would just be done one repo at a time.

Can anyone see challenges with this approach?

colonelpanic8 commented 1 year ago

Yeah, I mean in general, I think that the git sync flow is just really different than all the other types of sync flows.

amberin commented 1 year ago

I suppose what I am talking about is quite a major change, since it would mean that repos, not notebooks, becomes the main sync object. Perhaps it would require an unreasonable amount of re-writes. But it should be worth at least exploring, to make an assessment.

amberin commented 1 year ago

Yeah, I mean in general, I think that the git sync flow is just really different than all the other types of sync flows.

You are right.

Idea: Sync notebooks one-by-one same as now, but skipping those linked to Git repos. After that loop, add a separate method which loops over Git repos, syncing them with a different workflow. This way, we should only need to rewrite Git-related code.

Notebooks without links also need to be handled, of course.

colonelpanic8 commented 1 year ago

Isn't there already a separate interface/method for two way sync, which, at the moment, is only implemented by git sync anyway?

amberin commented 1 year ago

Yes. Although I'm not sure about the usefulness of that category. Is it evident that all "two way sync" repos should be synced per-repo, instead of per-notebook? Probably not. As you said, Git is just different from all the other types (so far).

colonelpanic8 commented 1 year ago

Yes. Although I'm not sure about the usefulness of that category. Is it evident that all "two way sync" repos should be synced per-repo, instead of per-notebook? Probably not. As you said, Git is just different from all the other types (so far).

Sure, my point is that at this point that interface was created solely to support the git use case. If our requirements are slightly different, I think its fine to simply change the interface as we need be, before there are other consumers.

Or perhaps create an entirely new one. The point is that I think that we agree that we should think about what we need in a syncing interface from first principles without thinking about how orgzly currently does things.

amberin commented 9 months ago

I have experimented with this a bit. I now have a branch with an IntegrallySyncedRepo interface, which allows syncing a repo as a whole. So far, I have achieved the following:

only one fetch per sync occasion
only one push per sync occasion
only one temp branch per sync occasion
If using SSH transport, the connection is held open after fetching until after (possible) push (using AutoCloseable).
Only create temp branch if there are remote changes; otherwise just commit everything on the current branch.
If there are remote changes, use git diff to decide which notebooks have changed and thus need to be reloaded.

Syncing is significantly faster, especially when there are remote changes or changes in multiple notebooks. I have learned that fetch and push actions are what takes the most time, which is why I implemented AutoCloseable for the SSH transport. Fetch now typically takes me around 2 seconds (of which most is setting up the SSH session), but the subsequent push typically takes less than 0,5 seconds.

I still need to solve the following (but don't see any difficulties):

force load/save currently does not perform fetch/push, since it uses the same utility methods for loading and saving local books as the sync logic. (Planning to add dedicated methods in the SyncRepo interface to allow different behavior here.)
I still make one commit per changed file; I believe there should only be one commit per sync occasion.
Should force load always load the book from the main branch, never from a temp branch? Not sure, could be confusing UX. But the status would show which branch was used.

chaoflow commented 5 months ago

@amberin This sounds great and I agree there should be only one commit for each sync. Is the added complexity of reloading only specific notebooks worth it or should we just reload all?

To further simplify, what do you think about the following?

A. Export and commit

Export all notebooks
Per repository commit everything on local branch - now everything is in git.
Try to push, no force, even if no changes (needed to recover from conflict, see below)
- SUCCESS: done
- FAILURE: continue with B

B. Rebase and reload

Fetch remote branch
Rebase local onto remote
- SUCCESS: try to push, no force
  - SUCCESS: reload all notebooks within repository
  - FAILURE: restart with B.1
- FAILURE:
  - force push local branch to configurable remote conflict branch
  - notify in orgzly-UI: user resolves in git, updates remote branch, and syncs again
  - done

This workflow should succeed, except if the remote branch gets rewritten. In that case we need a force-load that is manually triggered in orgzly:

C. Force load

Fetch remote branch
Reset local branch to remote branch
Reload all notebooks from local branch, discarding any local changes

amberin commented 5 months ago

@chaoflow Many thanks for your input! Just a few quick thoughts:

I like your general approach and I originally wanted to do more or less the same thing. I decided against it, but unfortunately I don't remember exactly why. It may have been ignorance preventing me from solving certain problems. I think I will take another look at it, as I like the idea of "commit and push" as the first step. At least for my habits, it would often suffice.
I also thought that "export, commit and reload all notebooks" would be the most simple solution, but I concluded that quite some looping is still required. We still need to set a sync status/action for each notebook, and for that we need to figure out what's happened to them. It then becomes pointless to perform actions on books which we know have not changed. But I would revisit this if we do implement the kind of "push first" workflow that you're describing.
I would not want to lose the ability to stay on a conflict branch and sync more changes from Orgzly, automatically returning to the main branch when it becomes possible. This requires attempting a merge with the main branch during every sync attempt if on a conflict branch. But it should be easy to add, and it doesn't need to be super efficient, since it's a more unusual scenario.

chaoflow commented 5 months ago

@amberin I think the approach I sketched achieves what you describe in your third point: in the case of a conflict, the local branch automatically turns into a conflict branch, without the need of having an extra branch for that locally, and for every sync a rebase is attempted.

The goals of the sketched approach are:

Push changes to remote main branch, if possible.
Push changes to remote conflict branch, otherwise.
Automatically recover, once user has resolved the conflict, i.e. rebasing to remote main is possible again.
Allow force-loading to discard any local state -- for force-saving there should be no need.
Produce linear history, i.e. do not enforce merges on the user.
Enable user to use merges to resolve a conflict or any other strategy.

An option is to reset the local main branch to the remote main branch before exporting. In case of conflict, orgzly would keep rewriting and force pushing its commit to the remote conflict branch, which might simplify recovery.

orgzly-revived / orgzly-android-revived

Sync one repo at a time #1