Make basis and original hash references strong

arv commented 4 years ago

Fixes #105

aboodman commented 4 years ago

This brings up the next point that of course we don't want all commits to strongly reference their basis. Because we want to actually delete history. But the local pending ones do need to ... because we need it during sync.

@phritz what should the actual policy be?

aboodman commented 4 years ago

Is it a true statement that for sync, we only need pending commits to strongly retain their basis?

arv commented 4 years ago

That seems right to me but I would like to get that confirmed.

phritz commented 4 years ago

Is it a true statement that for sync, we only need pending commits to strongly retain their basis?

I think it's true to say that we need:

all the local commits on the main chain to strongly retain their basis (this will include the base snapshot when pending commits exist)
all heads to strongly retain the thingy they point to (this will include the base snapshot when pending commits do not exist)
all local commits on the sync chain to strongly retain their basis (so we retain the sync chain)

Said another way, we need to retain all commits from the main and sync heads to the base snapshot.

We have to decide what to do about two cases:

Replayed commits: when we switch the main head to the sync head to complete sync, replayed commits will have a reference to local commits via original that no longer retained by the above rules. These original commits were previously on the main chain but no head is pointing to them anymore.
The basis of the base snapshot: the base snapshot is retained under the above rules but when we finalize sync we switch the main head to point to the sync head. The sync snapshot becomes the base snapshot and we need to decide what happens to the previous sync snapshot, the new base snapshot's (old sync snapshot's) basis?

phritz commented 4 years ago

I guess the question is how aggressively we want to prune old commits, specifically precious base snapshots we have synced on top of and old replay commits that we have replayed on a new snapshot. My sense is that we should not keep these old commits around.

aboodman commented 4 years ago

On Mon, Sep 28, 2020 at 12:07 PM Phritz notifications@github.com wrote:

I guess the question is how aggressively we want to prune old commits, specifically precious base snapshots we have synced on top of and old replay commits that we have replayed on a new snapshot. My sense is that we should not keep these old commits around.

I get the sense it will be easier to reason about if we remove everything that isn't strictly necessary. If you agree, let's do that.

phritz commented 4 years ago

If we remove everything that's not strictly necessary then we have to rewrite the sync chain in order to finalize sync. We need to update the sync snapshot to remove its basis and then each replayed commit on top has to have its basis pointed to its new parent. (If we make original strong, we also have to reset original in each of the replayed commits as we go. If we don't make it strong, it can just dangle. But preference for making it strong and cleaning it up.)

arv commented 4 years ago

The goal to remove everything that is not strictly necessary seems like the right thing to do. We do not want to use up more space than we need.

arv commented 4 years ago

We talked about this on VC and I did some more researching of the existing code.

Goals:

Only keep required data
Minimize fixups/rewrites

original_hash is currently read to compare the name so we need to keep it alive as long as we have the local commit. When the local commit goes away (replaced by a snapshot commit) the original local commit will also go away.

We have "GC roots" for head and the sync_head.

The only thing keeping a snapshot commit alive is a local commit, the head or the sync_head.

Things that keep a local commit alive is head, sync_head or it being the basis or original of another local commit.

Once we are fully synced there will be a single snapshot commit, referenced by head. Everything else can be collected.

This has no fixups. Things will go away as needed and once there is only a snapshot commit it will have a dangling reference to the basis which has now been collected.

phritz commented 4 years ago

Not sure what "no fixups" means, we need do to address the two things that I pointed out:

the sync snapshot currently has the base snapshot as its basis, so we either need to not set the sync snapshot basis or remove it when when finalizing sync and update the replayed commit bases (because removing a commit's basis changes its hash, and therefor its children need to be updated all the way to the head). i have some preference for keeping the sync snapshot basis as the base snapshot so that we can check invariants between the two just before we finalize sync, but we could do that in beginsync instead if we wanted -- just less of a tight guarantee.
replayed local commits on the sync chain have (presumably strong) original_hash pointers to pending commits. When sync is finalized we want the originals to go away. In order to make that happen we have to reset the original_hash of each replayed commit to not point to them. This necessarily changes the replayed commits' hashes and so we have to rewrite them.

phritz commented 4 years ago

Possibly one bit of missing info is that when we are done syncing head does not always point to the new base snapshot. Sometimes there are also local replayed commits on top of it. When the sync head becomes the new main head it can point to one of these local commits, which are still pending.

aboodman commented 4 years ago

I think I am in favor of making snapshot basis pointers always weak for simplicity because it avoids “fixing up” the commit chain after sync.

However I do see it weakens some invariants. Sync has to rely on the fact that the main head is keeping sync snapshots basis alive. I think this is correct trade off but fritz is closest to this code so his preference should override mine.

I think the only thing I’m firm on is that we shouldn’t ever retain any data not needed for sync.

phritz commented 4 years ago

Plan LGTM. Let me know when to review this PR.

arv commented 4 years ago

Maybe I'm missing something but the concept of weak vs strong is just an item in the refs entry. It does not change the chunk data. It is OK to have a string, with a hash that used to point to a chunk that is now removed.

We can therefore change original from strong to weak as we need without changing the data or the hash and we do not to rewrite (fixup) any chunks

Not sure what "no fixups" means

I mean that we do not want to rewrite chunks unless we have to.

Possibly one bit of missing info is that when we are done syncing head does not always point to the new base snapshot. Sometimes there are also local replayed commits on top of it. When the sync head becomes the new main head it can point to one of these local commits, which are still pending.

That is fine; the local commits keep the snapshot commit alive.

arv commented 4 years ago

PTAL

Maybe the enum Ref is a bit of overkill? I learned a lot by trying that approach though.

phritz commented 4 years ago

Maybe I'm missing something but the concept of weak vs strong is just an item in the refs entry. It does not change the chunk data. It is OK to have a string, with a hash that used to point to a chunk that is now removed.

Right. IF we wanted to remove the ref for cleanliness to not have dangling pointers THEN it would change the hash. But if we are comfortable with them dangling then no worries.

rocicorp / repc

Make basis and original hash references strong #202