tensorwerk / hangar-py

Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.
Apache License 2.0
204 stars 29 forks source link

[FEATURE REQUEST] Ability to delete data from the history #80

Open hhsecond opened 5 years ago

hhsecond commented 5 years ago

Is your feature request related to a problem? Please describe. Currently, we can't delete data from history once it is committed. Since in some cases where the end user triggers a data deletion request, organizations who use hangar for data storage and versioning will have to delete it from everywhere including history

rlizzo commented 5 years ago

@lantiga, Can you comment on how regulations outlined in the GDPR may drive the need for such a feature?

TLDR: This isn't possible with the current architecture, it is likely never possible to be fully enforceable, and we may not want to enable such features in the first place.

So there are a few issues that may not make this possible without some serious thought here:

1) Hangar is distributed. Even we had the ability to delete data from Hangar (which we can't) and push the change to the central repository server, there's no way to force the changes to be fetched on a client. They could retain that data forever if they never pulled from the repo!

2) A Merkel DAG like data structure is used to generate verifiable commit history. The content is intrinsically tied to a branch's HEAD commit hash. Should some piece of data be fully deleted from some commit in a repositories history, we would never be able to successfully generate a Merkel Proof. There would be no way to sync historical state, and every downstream commit's parent records would desync to non-valid pointers such changes were propagated.

3) We use a content addressable file store to deduplicate data. If we wanted to delete a sample which shared a hash with some other sample anywhere else in the repositories history, deleting the actual data would corrupt any dependent pairs to the same data on disk. Though the current implementation leaves much room for optimization, scanning the history to determine dependents of a single sample piece is a rather expensive operation which we want to avoid performing when possible.

4) Data stored in Hangar is supposed to be immutable in every respect. Short of legal requirements, i'm struggling to find a use case which would require the ability to irretrievably delete data. Can you elaborate @hhsecond?

Points 1 and 2 are by far the biggest holdups here. I really can't see a path forward here without a really insane amount of effort (if it's reasonably possible at all). Thoughts @lantiga or @elistevens?

elistevens commented 5 years ago

I think that the legal angle is going to become increasingly important. On one side, you have GDPR, DCMA, etc. and on the other are the risks of accidentally vacuuming up something unsavory from the web. It doesn't really matter if some client somewhere still has the content; what matters is that the user who has been notified that they have some content to remove can comply without burning down their entire data store.

I think that the important part is that there is a path forward, not so much that it's clean and easy and transparent.

I think it's fine to flag a number of content hashes as "deleted due to legal" or w/e, and if that ends up breaking other, non-problematic samples, well, that's the cost. Keep the reference to the hash, but nuke the content.

Another option would be to select certain branch-heads and tags, and say "recreate the repo, but only with these branches/tags available." You'd lose a ton of history, but that's kind of the point (who knows how long the offending content has been around). To be clear, I mean only the heads of the branches would be present; no historical commits. Of course, only the human-readable labels would remain the same; all of the underlying hashes would be new.

rlizzo commented 5 years ago

Intro

I can see how the legal aspects alone make this almost a certain requirement in the future.. But it's going to be quite a challenge to implement. We may want to talk to some experts to understand what the regulations actually require before spending what will likely be a significant chunk of development time on this.

It's actually a rather interesting problem to think about; it's definitely not impossible, but it'll need to be implemented with care. The biggest challenge is to keep user facing impact to an absolute minimum; ideally we wouldn't even have to implement a new top-level "porcalaine" command to enable this to work.

Problem

The biggest challenge in my mind is related to my point #2 above:

2) A Merkel DAG like data structure is used to generate verifiable commit history. The content is intrinsically tied to a branch's HEAD commit hash. Should some piece of data be fully deleted from some commit in a repositories history, we would never be able to successfully generate a Merkel Proof. There would be no way to sync historical state, and every downstream commit's parent records would desync to non-valid pointers such changes were propagated.

Rewriting history in this way is almost entirely equivalent to performing a git rebase on a public (ie. master) branch of a repository. The results of performing such an action on a large Git project would be essentially identical for a Hangar repo.

A few of the implications/scenarios we want to avoid:

I can see many scenarios where getting this wrong significantly handicaps the project in the long term. In my current view, losing history is the worst possible thing that we could do (opening up many more problems than we would solve).

Idea?

I'm thinking something slightly different: keeping more history.

I think our problems would be solved if we were to have a system which not only keeps a record of what events occur and how we believe they relate to eachother historically as observers right now in this moment of time; but which also keeps a (nearly) perfect recounting of those same details for how all of our long lost ancestors actually did experience events throughout history at their points in time.

Think about a repositories history as being events written in a diary passed down through generations of a family. Sure, things would probably look very different from our ancestors point of view, and sure, some details or precious artifacts may have been lost over the centuries, but by keeping a log which survives beyond the march of time (or an attempt to rewrite some of our history, maybe?) we would have a record which would allow us to link/guide someone at any point in time to the state which it exists in now.

In more technical terms

history:

 commit graph     contents

    a ------> s1 | s2 |    |
    |\           |    |    |
    b-|-----> s1 | s2 | s3 |
    | |          |    |    |
    | d ----> s1 | s2 |    | s4
    | |          |    |    |
    c-|-----> s1 |    | s3 |
    |/           |    |    |
    e ------> s1 |    | s3 | s4

Say we then had to go back to commit "b" and remove the data sample we added "s3" (removal marked by "**"). We remove the data, and only modify the "s3" field, preserving this version of history exactally. In order to generate a new working history which is verifiable we can use some binary XOR maths to recalculate what the hash of the commits affected by this change ("b", "c", "e") would have be if that sample hash had never existed in the first place.

    old history graph
------------------------------
commit graph     contents
   a ------> s1 | s2 |    |
   |\           |    |    |
   b-|-----> s1 | s2 | ** |
   | |          |    |    |
   | d ----> s1 | s2 |    | s4
   | |          |    |    |
   c-|-----> s1 |    | ** |
   |/           |    |    |
   e ------> s1 |    | ** | s4

 digest map
------------
old -->  new
------------
 a  -->  a
 b  -->  bb
 c  -->  cc
 d  -->  d
 e  -->  ee

    new history graph
------------------------------
commit graph     contents
   a -------> s1 | s2 |    |
   |\            |    |    |
   | \           |    |    |
   bb-|-----> s1 | s2 | ** |
   |  |          |    |    |
   |  d ----> s1 | s2 |    | s4
   |  |          |    |    |
   cc-|-----> s1 |    | ** |
   | /           |    |    |
   |/            |    |    |
   ee ------> s1 |    | ** | s
   ⋮
   ⋮
[NEW HISTORY APPENDED HERE]

Both records of history are kept so that even though the data ("s3", now `""`) is no longer accessible in either view, the integrity of the new history can still be verified** A massive oversimplification of such a process would happen by:

1) calculate the digests of what samples still exist in the new history (a standard verification proof). 2) reversing the XOR operation with the digest of the sample whose underlying data was deleted (essentially calculating f_xor("bb") -> "b") 3) proceeding down the chain of historical views verifying that each transformation results in the same digest which was recorded at that time.

This would solve essentially every problem a Git style rebase would introduce for us:

The only caveat is that during a "jump" into a new history, we do not allow squash operations. It's important to have a mapping of each old commit to it's new state in the updated history.

I'm ignoring a whole ton of complexity which surrounds this entire venture, but i'm interested in your thoughts @lantiga and @elistevens.

elistevens commented 5 years ago

I think that you're trying to solve a much larger problem than you actually need to. I don't know enough about the internal implementation to propose a concrete solution, but I think that this should be close enough to inspire something that actually works.

Let's say you've got 4 tensors(Samples right now). They contain the following:

W&P has the following chunk SHAs: 111 222 333

HP has: aaa bbb ccc

Combined WP-HP: 111 222 399 b99 c99

Combined HP-WP: aaa bbb cff 2ff 3ff

The combined texts have changed SHAs for the second work because the chunks have different split boundaries.

You get a legal nastygram saying that HP, WP-HP, and HP-WP need to be removed. That translates into the SHAs:

aaa bbb ccc 111 222 399 b99 c99 cff 2ff 3ff

That's every chunk present except 333. Note that the 111 and 222 chunks are false positives. How those get handled is, IMO, not super relevant to this issue (you'd need more robust tools to examine samples that have chunk overlap with problematic chunks, and aside from a chunk that consists of only 0.0s, I suspect that will be very rare in practice). Let's say that we don't do anything fancy, and just nuke them all.

At your chunk storage layer, you keep a list of all purged chunk SHAs. When a user tries to get a sample that has a purged SHA, they instead get an exception PurgedDataUnavailableException or w/e. When trying to re-add a SHA that's been purged (say, once someone notices that WP is missing, and verifies that it's fine), maybe there's a flag that you have to pass in to override the purge flag or something.

It might be nice to have the purge also make a commit that removes the samples from some list of branches, so that going forward users won't be getting exceptions when they iterate over the data. Note that this is the only part of any of this that touches history or the DAG. Nothing existing changes, except that the purged data is now kinda in this phantom state; the ghost remains (SHA, size, etc.) but the actual body is gone.

rlizzo commented 5 years ago

@elistevens, I should actually clarify a few important point about the internals:

1) the full content of each sample is used to generate the hash (there is no internal chunking or splitting of data/digests at this level)

2) if a sample has the same digest as some other tensor (sample) added to the repository (at any point throughout history) the data is not actually written to disk again, we just reference that digest for lookup later. Therefore it is possible (and common) for multiple Samples in the same dataset, or for samples across disparate datasets, to refer to the same actual digest on disk. Retrieval info the tensor data is stored as the value for a digest key.

As a result of point 1, the example you gave would actually result in the following situation:

W&P digest            : 111
HP digest             : 222
Combined WP-HP digest : 333
Combined HP-WP digest : 444

A request to remove the three HP related samples would not affect W&P in any way. As such I don't think this is a necessity at all?

At your chunk storage layer, you keep a list of all purged chunk SHAs. When a user tries to get a sample that has a purged SHA, they instead get an exception PurgedDataUnavailableException or w/e. When trying to re-add a SHA that's been purged (say, once someone notices that WP is missing, and verifies that it's fine), maybe there's a flag that you have to pass in to override the purge flag or something.

A much more difficult case arrives from point 2:

Say we have a very simple dataset containing sampled named by some user_id and returning a one element array of user_age. Many of these values will overlap, and as such we won't replicate the data on disk.

user_id | age | digest
----------------------
user1   | 24  |   a
user2   | 56  |   b
user3   | 39  |   c
user4   | 24  |   a

If we received a request to remove the data for the account of user1, just simply deleting digest a data would also remove the data for user4. In order to ensure it is safe to actually delete data from disk, we have to check the entire history to ensure that no other sample refers to the same digest. (We actually have the required scan/algorithms implemented in the core for use during merge conflict checking and client/server negotiations.)

To handle this case, we would instead have to overwrite the sample user1 digest pointer to a flag indicating the data was removed and cannot be accessed (effectively deleting the key which is needed to actually locate where the data is on disk); this would leave digest a still on disk so it can be accessed by it's other reference (user4). HOWEVER, If we received the same request for user2, we mark the sample digest pointer in the same way, but as it is unique we can delete the data from disk without worry.

This is the point were things break down

In order to generate the commit hash digest, we basically hash a concatenation of the parent commit's hash with every sample digest recorded in the commit. If we were to do this after marking sample digests as removed a totally different digest would result. If there is only one authoritative version of history, it would be impossible to verify the repos integrity (is the different hash a result of disk corruption, malicious use, or the removal operation?) At this point there's no way to avoid the problem's i've mentioned before...

Since we can't keep the digest attached to the sample for legal reasons (someone could just modify the hangar code to ignore the flag variable and retrieve content they shouldn't be able to), nor can we just remove it from history entirely without causing huge issues, I can't see any way around the need for a way to deterministically map the transformation of one recollection of history to a new (verifiable) one.

In response

It might be nice to have the purge also make a commit that removes the samples from some list of branches, so that going forward users won't be getting exceptions when they iterate over the data.

This is a great Idea! I'll remember this when the time comes!

Note that this is the only part of any of this that touches history or the DAG. Nothing existing changes, except that the purged data is now kinda in this phantom state; the ghost remains (SHA, size, etc.) but the actual body is gone.

Hopefully i'm not missing something obvious here, but from my understanding the phantom state is the root cause of all the problems i've mentioned perviously. Does that make sense?

elistevens commented 5 years ago

Separate out the "remove user1's data" use case, since that's not the same thing. Maybe one mechanism can solve both, maybe not. There, the problem isn't "24", it's the association of the tensors that make up user1.

I have been assuming that there's a separation between the objects that represent the state of the repo, and the data storage that contains the actual tensor bytes, similar to how git has commits, trees, and content.

I'm saying you can keep the commits and trees, and replace the purged object with a tombstone that says "sorry bro, this got purged. trust me, it totally had the hash 1234 when it existed." That happens outside of your commit/tree structure, and the data is truly gone. You'd have to also propagate tombstones when syncing, but that shouldn't be too difficult (I know, famous last words).

hhsecond commented 5 years ago

@elistevens The problem I could see with replacing a data piece with tombstone comes back to the same thing @rlizzo mentioned -> what if two commits (data of two people) points to the same data piece in the storage layer. As in the above example, if user 1 deletes his/her data, and user 4 doesn't, we need to keep the data piece intact but somehow remove the pointer of user1's hash to the actual data piece. Correct me if I have missed something from your suggestion?

But what if we just remove the retrieval info value for the digest key? So we have two cases

  1. If the data piece is not redundant -> we can then delete the data piece from the storage and replace the value field for the digest key with predefined InvalidReference.
  2. If the data piece is a duplicate -> We'll keep the data piece in the storage layer but replace the value field for the digest key with a predefined PurgedData so that the retrieval info is unknown for that particular hash.

For those users who try to sync the data whenever they want, we could look into the value of the hash keys and make it invalid or delete the actual data as needed. And then whenever we find purged data value in the value field, we could raise PurgedDataUnavailableException. I could see how inefficient it could get with very large data collections and this become slightly complicated if somebody wants to add a data piece that has a similar hash in the history which is purged/invalidated. But we can solve all these by easy fixes. @rlizzo I am thinking aloud here. This being an obvious solution, I would assume you have gone through it already but love to see what were your rationale behind not choosing it if you have.

elistevens commented 5 years ago

Again, that's a different use case, because "age: 24" isn't the same thing as the complete text of harry potter, and fixing it will require a different approach.

For the case where you need to delete (username: user1, fname: james, lname: smith, age: 24) it's the association that's the problem, not the data itself. I don't have any clever ideas on how to solve that without blowing up all of the problematic historical data. Until someone comes up with something better, the best I can think of is an easy way to remove a given key from all datasets on all branches (so lots of new commits, or one commit cherry-picked over and over) so that at least nobody can get at the data from a current checkout.

But I think that tombstones can solve the harry potter problem.

rlizzo commented 5 years ago

Again, that's a different use case, because "age: 24" isn't the same thing as the complete text of harry potter, and fixing it will require a different approach.

@elistevens I think this is the fundamental difference in assumptions that we are making which are leading us to different conclusions. From the perspective of Hangar, I don't think there should be a distinction between types of deleted data. It'll get very very messy if that's the case. While the tombstone: "trust us, this existed and worked with this hash at one point in time" is fine in theory, it's not ideal. I'd rather leave it up to the user to know that (in the HP case) what their specific legal requirements are to remove the data. If it's referenced by multiple Samples, then they should know that and mark them as Deleted accordingly.

@hhsecond, the problem with:

But what if we just remove the retrieval info value for the digest key? So we have two cases

Leads exactally to my points which I've outlined in:

This is the point were things break down

In order to generate the commit hash digest, we basically hash a concatenation of the parent commit's hash with every sample digest recorded in the commit. If we were to do this after marking sample digests as removed a totally different digest would result. If there is only one authoritative version of history, it would be impossible to verify the repos integrity (is the different hash a result of disk corruption, malicious use, or the removal operation?) At this point there's no way to avoid the problem's i've mentioned before...

Since we can't keep the digest attached to the sample for legal reasons (someone could just modify the hangar code to ignore the flag variable and retrieve content they shouldn't be able to), nor can we just remove it from history entirely without causing huge issues, I can't see any way around the need for a way to deterministically map the transformation of one recollection of history to a new (verifiable) one.

Does that make sense? or am I misreading the question?

elistevens commented 5 years ago

I think that what you're calling "deleted data" there is just "not available on HEAD of this branch" right? You could still check out an older commit and the "deleted" data would still be there, right?

My desire is to have a mechanism where the data is 100% unavailable. If you're shipping a repo that has a copyright violation, it doesn't really matter if you have to check out an old commit to get it or not.

rlizzo commented 5 years ago

Not quite. I'm saying that if you delete data in HEAD, the changes will propagate through history from the initial point that the data was added to the repo. That's why there's so many issues with history.

elistevens commented 5 years ago

Oh, yeah, rebasing everything from the moment the problematic data was added forward is going to be awful. The whole point of tombstones is to avoid that in the cases where it's only the data bytes that are problematic.

rlizzo commented 5 years ago

Yeah... considering there's not even a rebase function included in Hangar yet, I think we have a ways to go until we can implement this feature cleanly.

At the very least, we should take this thread and archive the discussion and potential solutions on our docs page for future reference. (though I'm still interested in @lantiga's opinion)

@elistevens, Would you be available for a con-call one of these days to brainstorm on potential options or workarounds?

lantiga commented 5 years ago

Hey! Late to the party it seems, but great discussion. By reading the whole thread, a few comments come up.

For the time being, tombstones are going to be the only viable solution to me. Also, (it might be obvious but it's useful to keep in mind) we are just storing tensors, not records like (username: user1, fname: james, lname: smith, age: 24).

The issue with privacy might be in different spots, but all related to individual samples:

We need to be able to purge all these from history, and each case has to deal with a different place.

The tensor Tombstones + wiping out the tensor from the storage are the way to go here. The info has to disappear no matter where the sample appears throughout history. Creating an indirection method where hashes appearing in the commit logs go through a tombstone mapping before landing on storage could be devised, not sure what's more convenient.

The association Similar to the above, but you don't wipe the tensor but the association. This reiterates the need for the hashes appearing in the commit log to be a) unique b) associated to the actual content addressed data hashes through an indirect mechanism. Let's think about how this could be done, maybe using hashing magic.

The meta data We will eventually have a dedicated LMDB storage for metadata, which will have to hold metadata throughout history for a certain key, since metadata is at the key (i.e. sample name) level (assuming this is the correct thing to do). We would wipe metadata or the relevant part thereof, but this wouldn't have an effect on hashing, so it should be easier (unless we end up hashing it for some reason).

The sample name AFAIK, the name does not take part in hashing. This is convenient for our purposes (we could retrospectively rename the sample), but I'm wondering if the fact that naming does not take part in hashing is secure at all. We need to be able to verify the history of a named dataset, since this has implications for stability. If we do, then we can't rename a sample and we need to tombstone names.

As a general consideration, we absolutely need to encode the recipe for the changes in some way, so that whoever cloned the dataset can have a change to still work with the dataset by applying the same changes. Maybe just overwriting the metadata, possibly preserving local changes, or applying the same "recipe" that was applied remotely locally. I general I really liked the super-DAG idea that @rlizzo had, let's think about the implications.

One other sticky point is blocking pushes to a remote if data that has been tombstoned hasn't been tombstoned. We need to avoid undoing the changes due to a stale local repo pushing data back.

Great discussion, it would be ideal to have a minimal plan for implementation that we might or might not enact right away.