Alterative hashings for data

iterative / dvc

🦉 Data Versioning and ML Experiments

https://dvc.org

Apache License 2.0

13.88k stars 1.18k forks source link

Alterative hashings for data #1676

Open dunstantom opened 5 years ago

dunstantom commented 5 years ago

Since md5 is sensitive to the order and format of the data, simple changes to the schema (eg. swapping two columns) or changing the type of a column (eg. integer to float) leads to new hash values and duplicated datasets. There are some alternatives that attempt to address this, such as UNF (http://guides.dataverse.org/en/latest/developers/unf/index.html).

It would be great to specify an alternative hash function in DVC, particularly to be able to provide a user-defined function.

efiop commented 5 years ago

Hi @dunstantom !

That is a very interesting idea! So is the cache duplication the main motivation for this request? Or do you also want to use something different than md5 for checking if your dependencies are changed in pipeline scenario?

Thanks, ruslan

dunstantom commented 5 years ago

I would say mainly it is having an alternative for md5. On one hand, one could argue about how SHA1 compares to md5, but on the other hand (and what I had in mind) we may have our own preferences on what kind of changes necessitate recalculating/updating a dataset. This ties in to limiting cache duplication as well, but more from the perspective of avoiding unnecessary computation rather than a concern of having storage for the cache. Hope that answers your question.

efiop commented 5 years ago

@dunstantom Thank you for your answer! 🙂 We do think about moving to SHA too, but other users also requested some faster hashing as well, so looks like it is a general problem of supporting different hashing algos . Related to https://github.com/iterative/dvc/issues/1568 . This is something we'll take a closer look later. Thanks for the feedback! 🙂

rasmuse commented 5 years ago

@efiop I asked earlier this week in #1568 about the possibility to use other hash algorithms. I'm shifting that part of the discussion here since this issue seems to be a better match for my request (#1568 could be solved in principle by changing to one other hash algorithm, i.e., just switch the md5 function for another one).

Short report on my progress: I made a local git checkout -b not-only-md5 yesterday. I notice that the code assumes in many places that each Remote implementation is tied to a single hash algoritm. This is most evident from the class-level constant PARAM_CHECKSUM (the name of the hash algorithm).

My first idea about how to implement this is to move the info about the hash algorithm to the Output and to let each instance of an Output have its own hash algoritm. At present the OutputBase class has a property checksum that depends on its self.remote.PARAM_CHECKSUM, and my idea would be to invert this relationship so that the Remote would get Output-specific information on the hash algorithm from the Output.

But I only know a very small part of the code base, so I'm hoping you can help me judge if this idea makes any sense at all.

It would then instead be up to the specific cache implementation to check the hash function on the Output instance and decide on a cache path. In principle one could imagine something like this in the cache directory:

cache
├── md5
│   └── ab
│       └── cdef
└── sha256
    ├── 12
    │   └── 3456
    └── 89
        └── 0123

But for backwards compatibility any md5 hashed files could be stored under their old paths, like so:

cache
├── ab
│   └── cdef
└── sha256
    ├── 12
    │   └── 3456
    └── 89
        └── 0123

efiop commented 5 years ago

Hi @rasmuse !

We used to have an assumption that you might have a few checksums, but the code evolved in such a way, that we started to stick with single-checksum assumption. That being said, the underlying logic is still partially there. For example, if you take a look at checksum_info(also called info sometimes), you'll see that it is a dict, so you could have a few checksums there. Also, each remote has its own checksums(mostly), but even if you use the same remote for external cache(e.g. where cache is organized by etags) and push your local cache there (organized by md5s), you'll get the same dir structure(say we have a cache file with md5 123456 and a cache file with etag ABCDEF):

cache
├── 12
│   └── 3456
└── AB
    └── CDEF

So far, the likelihood of those two colliding was pretty slim, so we didn't worry about it too much. But if we are moving to a more general approach, it totally makes sense to separate them somehow. I like your idea with creating subdirs for each hash type and I agree that for backward compatibility we should stick with current cache structure for md5 caches, meaning that it won't be put into cache/md5/ for now.

We actually have a few problems with a current hashing (for example, see https://github.com/iterative/dvc/issues/992) and I think we could switch to using cache/md5 then, so we could distinguish between the old md5 cache and the new true one. Btw, our ssh remote in external outputs scenario is using true md5, so we could make it use new cache dir structure right away :nail_care:

This is how I see it (approximately):

1) Introduce hash config option in section core, that would tell dvc which default hash algo to use;

OPTIONAL make it also accept a number of hashes in a given order of priority. E.g. if config looks like

[core]
hash = sha256, md5

it would try to compute sha256 first, but if remote doesn't support sha256 for some reason, it would try to fallback to md5.

NOTE if we would want to, we could easilly support per-remote and even per-output/dep hash type configuration later.

2) Make OutputBase by default use hash type that is specified in the config(this is effectively going to replace current PARAM_CHECKSUM constant); Note that if we are simply checking if anything has changed (e.g. with dvc status), output should be smart enough to fallback to md5 checks if it doesn't have sha recorded in the dvcfile yet. E.g.

$ dvc config core.hash md5 # default
$ echo foo > foo
$ dvc add foo
$ dvc status # should report that everything is up to date
$ dvc config core.hash sha256
$ dvc status # should still report that everything is fine

Or, maybe, if you specify strictly hash = sha256, you would actually want to forbid using any other hashes(such as md5s) at all. This is where that optional part from 1) would come in handy, because the behaviour described above would be only possible with

dvc config core.hash sha256, md5

:slightly_smiling_face: But we could totally start with a simple fallback approach for now.

3) Make OutputBase use self.remote.hash(self.hash_name, self.path_info) method instead of self.remote.save_info that is used right now;

4) Implement RemoteBase.hash(hash_name, path_info) that would call specific hash method like self.md5(path_info) or self.sha256(path_info). Obviously, for now, only dvc/remote/local.py would support sha and all others would have to rise smth like a NotImplementedError. But we will be able to generalize that for all remotes later (Related to https://github.com/iterative/dvc/issues/1410).

5) Implement RemoteLOCAL.md5(path_info) and RemoteLOCAL.sha256(path_info);

NOTE this is where it might get tricky, because our RemoteLOCAL uses our state db for caching md5s, so we don't have to recompute them each time. For now, you could simply not cache sha values at all, to make it simpler for you. We have some state generalization logic coming in https://github.com/iterative/dvc/issues/1654 , that will help us easilly cache sha (and other hash types) the same way as md5, so will be able to help you with that later. :slightly_smiling_face:

Hope this helps :) Let us know if you have any questions and/or need help.

Thanks, Ruslan

xkortex commented 4 years ago

It's great that this is being worked on. Solving for the general use case of configurable hash is excellent, even though I'm sure there is some extra complexity there. There's a lot of advantages to different hash schemes.

blake3 - fast as heck, also available as a python package with no other deps. (This would be ideal for DVC since it's so fast, but the package is new-ish and old-and-stable is always valuable. So having that configurable based on a user's risk tolerance is 💯 great) sha1 - same hash as git, bittorrent (historically) sha256 - same hash as git, bittorrent (v2)

mbraakhekke commented 4 years ago

It seems to me that @dunstantom's issue is not really addressed by the proposed solution. The problem is that trivial permutations of the data result in different hashes. If I understand it correctly, alternative hash algorithms will not solve this. Only custom hash functions, or possibly wrappers around the built-in hash functions can. My problem is that most of my data formats have a creation time stamp embedded so each run will result in outputs with different hashes. Would be great if I could define my own hash function, e.g. in the form of a python function.

mbraakhekke commented 4 years ago

I've forked DVC and implemented some changes to allow the user to supply hashes in separate checksum files. If a hash is needed for a file in the dataset, DVC searches for a checksum file with the same name but suffixed with ".md5". If this is found, the hash is retrieved from the checksum file instead of using the algorithm on the dataset file. I've run all the tests--some fail, but these also fail for the master on my system. I propose this as an option for allowing user supplied hashes but I'm not sure if submitting a pull-request makes sense at this stage. I imagine my changes might have some consequences I can't oversee. Also, I guess you'd want this controlled with some configuration setting or option passed to DVC. But I'd be happy to develop this further if the DVC developers let me know what needs to be changed.

efiop commented 4 years ago

@mbraakhekke Timestamps are a classic problem of reproducibility: e.g. https://wiki.debian.org/ReproducibleBuilds/Howto#Files_in_data.tar_contain_timestamps . Usually the way to solve it is to just strip the timestamps off your files, since if you can use a custom md5, likely you just don't care about those stamps at all. Would that be a possible solution for you?

Your approach works too, but it is error-prone, as you are pretty much hacking dvc and dvc won't be able to verify the cache files with the md5 that you provide. We could use it only for dependencies though, as those just need a checksum and not necessarily a full blown hash. That's why we were discussing alternative hashes, so that you could make dvc aware that you want to use some custom hash and whether or not dvc could actually treat it as hash for caching purposes.

To my understanding, in the scenario that you've mentioned, I think stripping the timestamps is the better solution, as it is explicit and doesn't rely on dvc hacks. If this solution is acceptable for you, of course.

xkortex commented 4 years ago

Rephrasing the original problem statement a bit, here's what we are looking for:

Given a set of binary blobs x_n ∈ X, x is a vector of bytes, and some transformation with an outcome F : x ↦ y (our "data science function"), what we want is a hash function H : x ↦ h_x such that for all n, x, H(x) = h_x, F(x_n) = y. In other words, we want some H such that the premimage of H matches the preimage of F. Ignoring adversarial preimages for a moment, for "true" cryptographic hash functions, there is only one x_n which satisfies this criteria, since any perturbation to x perturbs H(x), that's the point of crypto hashes.

If I understand correctly, this would put our desired functions in the class of Locality-sensitive hash functions (LSHF), and they are effectively orthogonal to cryptographic hash functions (CHF). md5 is effectively a crypto hash, it's just bad at it (all the more reason not to use it as the default dvc hash). UNF would be a LSHF.

The "locality" in this space is domain-specific to the problem at hand and the given function F varies across projects - i.e.: does the order of columns in a dataframe matter? If you only use column name indexing, no. If you use any kind of integer indexing, yes.

The "cool" solution would be to allow the .dvc metadata for a given file to specify the hashing algorithm, as a runnable command or python entrypoint. You could specify your own code that would compute the hash any way you pleased. I've been working on a caching/memoization wrapper library and this is exactly what I do. This is great from a UX point of view, and frankly challenging from the Ops/deployment point of view, and likely a nontrivial technical challenge for the DVC team as well. Now you have to somehow inject the python hashing code, which I guess since you already have the dvc metadata structure, the path specify isn't horrible. Environment will be a mess so I'm just going to assume it'll use the environ DVC itself uses.

I can see a route to this solution which involves:

Adding functionality to .dvc/config allowing user to specify hash-hooks similar to .gitattribute clean/smudge hooks.
A base DVCHasher class interface to be implemented by users (likely just needs a single def hash_file(fn: Path) -> str stub)
Modify schema of *.dvc files:

hash: 3e6e33a875f8d8de4de2879dc7d5eaa7 hashfn: some_sort_of_reference path: spameggs2.txt


where `some_sort_of_reference` is one of the standard `md5`, `sha256`, `blake3` etc hash functions, or a reference that DVC can make sense of. Possibly some kind of URN so it's clearly not a default hash, e.g. `custom:path/to/hasher.py`

A big honking disclaimer that if you use custom hash code, do not come crying when things break in bizarre ways 😉

At least, that's how I'd personally do it.

shcheklein commented 4 years ago

@xkortex @mbraakhekke @efiop how about this one https://github.com/iterative/dvc/issues/3439 (and we have PR by @casperdcl that implements it) ? Especially the discussion with a specific proposal, second part of the ticket. It seems to me that that mechanism covers some problems discussed here. It might not cover the part that is related to replacing md5 with a better hash for the sake of performance, cryptographic strength, etc.

casperdcl commented 4 years ago

yes #4363 is still work in progress; will probably find time to continue in a few weeks unless someone want to take it over.

mbraakhekke commented 3 years ago

Late reply. Nice to see everyone's input!

@mbraakhekke Timestamps are a classic problem of reproducibility: e.g. https://wiki.debian.org/ReproducibleBuilds/Howto#Files_in_data.tar_contain_timestamps . Usually the way to solve it is to just strip the timestamps off your files, since if you can use a custom md5, likely you just don't care about those stamps at all. Would that be a possible solution for you?

@efiop: Yes, the time stamps are not relevant.

Your approach works too, but it is error-prone, as you are pretty much hacking dvc and dvc won't be able to verify the cache files with the md5 that you provide. We could use it only for dependencies though, as those just need a checksum and not necessarily a full blown hash. That's why we were discussing alternative hashes, so that you could make dvc aware that you want to use some custom hash and whether or not dvc could actually treat it as hash for caching purposes.

OK, this would be one of those unforeseen consequences I mentioned. Yes, I see that for verification purposes this would cause problems. Indeed, my main motivation is avoiding unncessary reruns of pipeline stages. So you're saying that the custom hash would complement the standard hash, rather than replace it?

To my understanding, in the scenario that you've mentioned, I think stripping the timestamps is the better solution, as it is explicit and doesn't rely on dvc hacks. If this solution is acceptable for you, of course.

I see your point. However, this is easier said than done, particularly for binary files. Some examples from a project I'm currently working on:

Excel (.xlsx) workbooks
matlab data (.mat) files
images (e.g. png)
binary builds of my model (executables/libraries)

For the first two I could switch to other formats. However, in general this is not always possible--some systems need to be able to work with certain formats. Regarding the last point (binaries) this is more tricky. I guess is possible to patch a binary to remove timestamps but I've found there's also other sources of seemingly random variation.

I want to convince my team to start using DVC but most of them have little experience with this sort of stuff, nor the time/patience to figure it out. I really like DVC but I think this could be an obstable for widespread adoption, particularly for Windows users.

mbraakhekke commented 3 years ago

@xkortex @mbraakhekke @efiop how about this one #3439 (and we have PR by @casperdcl that implements it) ? Especially the discussion with a specific proposal, second part of the ticket. It seems to me that that mechanism covers some problems discussed here. It might not cover the part that is related to replacing md5 with a better hash for the sake of performance, cryptographic strength, etc.

@shcheklein, this discussion is quite long. Could you tell me with which post the relevant part begins :wink:?

shcheklein commented 3 years ago

@mbraakhekke sure, I guess the last comment summarizes it well enough - https://github.com/iterative/dvc/issues/3439#issuecomment-663873682

The idea is that for every dependency and/or output in DVC files we could specify a custom script, command, whatever to execute on that dependency/output before we calculate md5. To be be precise: cmd <dep-file/out/file> | md5 instead of md5 <dep-file/out/file>.

casperdcl commented 3 years ago

Yes I'd say #3439 and #1676 are duplicates - should probably close one to streamline the discussion. I think #3439 has more recent implementation discussion but #1676 has a more appropriate title.

respatialized commented 1 year ago

I'd like to echo some of these concerns and add a complementary perspective on why user-defined hashes might be valuable for pipeline stages. In addition to having value for custom analysis of data dependencies, a UDF for hashing also introduces the potential for more fine-grained analysis of code dependencies.

The motivating example for me, coming from the Clojure world, is the fact that changes to a source file don't always change a program's tree. Some examples:

whitespace

The simplest example: two forms that differ only in their whitespace placement will naturally yield equivalent programs.

(map inc (range 25))
;; is the same as
(map inc
     (range 25))

Changes like this will trigger a re-run of a pipeline stage, even though no definitions in the program have changed. Granted, this can be avoided by a "format on save" workflow, so it's not as though a solution doesn't currently exist for this particular problem. But there are other considerations:

regular comments

Adding a comment to some code after you've run it will also change the hash, a situation likely familiar to many DVC users regardless of their preferred programming language.

(mapv dec (range 89 23 -2)) ; a change to this comment changes the hash

`(comment ...)` forms

In Clojure, a specific type of form is used for example code that is evaluated from the REPL or console: the rich comment form. It yields nil when evaluated as a whole; generally sub-expressions within it are sent to a running Clojure process for interactive and iterative development. If these inline test expressions change, the pipeline will similarly need to be re-run, even though the form doesn't affect the program's behavior.

UDFs and static analysis

Static analysis tools exist for Clojure - like tools.analyzer or clj-kondo - which can detect which changes are significant, like function definitions, and which aren't, like whitespace, regular comments, and rich comment forms.

Adding the possibility of UDFs to dvc enables other programming languages to use their ecosystems' own static analysis tools to similarly detect relevant and irrelevant changes to files. I imagine many people would like the ability to add more comments to a long-running piece of code without needing to remember to manually dvc commit those changes each time they document a stage in their pipeline. So for that reason, I hope that user-defined hash functions make their way into DVC!

rasmuse commented 1 year ago

@respatialized

Interesting perspective! I have had similar thoughts but always thought of it a little bit differently. My idea would be to use such tools as a separate processing step which treats the code as data and transforms the human-readable (commented, idiosyncratically formatted, etc) code into a normalized form (free of comments, normalized formatting, etc). This normalized code would then be input into the actual processing step. As long as the normalized code has the same hash digest, the output of the (potentially expensive) calculation can be fetched from cache.

The benefit of this approach is that nothing "new" is really introduced into the workflow system. The code normalization step becomes a processing step like any other.

What are your thoughts on this idea?

majidaldo commented 1 week ago

@respatialized

Interesting perspective! I have had similar thoughts but always thought of it a little bit differently. My idea would be to use such tools as a separate processing step which treats the code as data and transforms the human-readable (commented, idiosyncratically formatted, etc) code into a normalized form (free of comments, normalized formatting, etc). This normalized code would then be input into the actual processing step. As long as the normalized code has the same hash digest, the output of the (potentially expensive) calculation can be fetched from cache.

The benefit of this approach is that nothing "new" is really introduced into the workflow system. The code normalization step becomes a processing step like any other.

What are your thoughts on this idea?

I lean towards that this can be solved outside of dvc. The only 'value' I might see to integrate it with dvc is that it would save typing the normalizing code.

iterative / dvc

Alterative hashings for data #1676

whitespace

regular comments

(comment ...) forms

UDFs and static analysis

`(comment ...)` forms