iterative / dvc

πŸ¦‰ ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.61k stars 1.18k forks source link

Support hashings other than MD5 #3069

Open adrinjalali opened 4 years ago

adrinjalali commented 4 years ago

The docs seem to imply that md5 is the only hash users can use. The vulnerabilities of md5 have made it not usable in many organizations which require a certain level of security.

Is the a way to use SHA1 or SHA256 or other hashes instead?

I saw some PRs (https://github.com/iterative/dvc/pull/2022 for instance), but they're closed w/o being merged. What's the progress on that front?

efiop commented 4 years ago

Hi @adrinjalali !

In our case md5 is not used for cryptography, it is simply used for file hashing, where it is still suitable (at least I'm not aware of any issues with it in such scenarios), so it is safe to use for htat purpose. The contents of the cache files are not encrypted, you could see it for yourself by opening any file under .dvc/cache. If you need to encrypt your data, you could do that pretty easily on the remote itself (e.g. sse option for s3 remote) or encrypt in your pipeline along the way and only track those encrypted files. We also have https://github.com/iterative/dvc/issues/1317 for in-house encryption, please feel free to leave a comment there, so it is easier for us to estimate the priority. Would the cloud-side encryption be suitable for your scenario?

adrinjalali commented 4 years ago

I understand these hashes shouldn't be the only defense layer against a malicious player, but it is still a very useful thing. Imagine a malicious player who has write access to the storage (s3 bucket in this case). They can take a model file, modify it in a malicious way in a way the md5 is kept unchanged, and upload the new one. The rest of the pipeline now won't detect a change in the file, and will serve the new wrong model.

Again, I understand this should not be seen as the only way to secure the models, but in places where software goes through security audit before being used, md5 is one of the first things they're gonna use to reject it.

Another point is that having a SHA is not expensive or complicated at all, so why not include it?

JavierLuna commented 4 years ago

I agree with @adrinjalali, using MD5 to check for integrity is far from ideal.

As SHA-1 is starting to show collisions as well, DVC could join git and replace the hashing function to SHA-256.

efiop commented 4 years ago

@adrinjalali @JavierLuna Thanks for the info, guys! We will support other hash functions eventually, it just wasn't the focus for us, but we will reconsider it. We've had an attempt from an outside contributor https://github.com/iterative/dvc/pull/1920 , but he wasn't able to finalise it :slightly_frowning_face: If anyone is interested in contributing this feature(similar to #1920, where sha is a opt-in config option for now), we will be happy to help with everything we can.

I'll leave this issue opened in addition to https://github.com/iterative/dvc/issues/1676 , as this current one is about simply replacing md5 with something else (possibly as an option), but #1676 is originally about some custom hash functions.

JavierLuna commented 4 years ago

I'd like to help with both issues! Maybe I'll try working on #1676 and set the default hashing function to sha-256?

efiop commented 4 years ago

@JavierLuna That would be great! :pray:

Maybe I'll try working on #1676

Let's start with the current one, similar to how #1920 does it. Custom hashes could be implemented later, esp since we don't really know what type of hash creator of #1676 wants.

set the default hashing function to sha-256

Please don't set it as a default one, it should be configurable. Md5 should stay default for now. If the feature is implemented correctly, we will be able to switch to sha-256 later using 1-line PR :slightly_smiling_face:

Please let us know if you have any questions :) FYI: we also have a #dev-general channel in our discord, feel free to join :slightly_smiling_face:

pared commented 4 years ago

Also, we need to remember that changing default checksum will result in whole cache recalculation for users upgrading their DVC version, and this could bring a lot of issues. If having md5 is severe, the change of default checksum should probably be left for major version release.

Erotemic commented 3 years ago

It might be interesting to use xxhash: https://cyan4973.github.io/xxHash/ https://pypi.org/project/xxhash/

which is quite a bit faster than md5 (although like md5 it is not cryptographic, but that's fine for the dvc use case).

As a reference, I have a file util_hash in my ubelt library that supports generalized hashing. I may take a stab at integrating xxhash in a future PR.

Erotemic commented 3 years ago

I'm currently running a VERY long dvc add, and I wanted to ping again to add some steam to getting xxhash incorporated to dvc.

XXhash is quite a bit faster. I've done some benchmarks:

https://github.com/Erotemic/ubelt/blob/master/dev/bench_hash_file.py

You can run these yourself to verify on your hardware, but this is what I got on mine:

    xxHash is much faster than sha1, bringing computation time down from .57
    seconds to .12 seconds for a 387M file.

If the pattern holds, that means switching to the xxh64 hasher would give a 4.75x speedup when adding files.

jtrakk commented 3 years ago

DVC should only rely on cryptographic hash functions that are collision resistant, which would exclude md5 and xxhash. Otherwise it's easy to get the wrong file from a hash.

Erotemic commented 3 years ago

Don't confuse collision resistance with cryptographic security. DVC does not need a cryptographic hash function, but it does need a collision resistant one.

Many if not most datasets are small enough where I would expect xxh64 would be collision resistant enough. Either way, from a design perspective the hashing function should be configurable if/when we find better hashing algorithms.

AFAIK it is not necessary for a hash function to be cryptographic for it to be collision resistant.

https://en.wikipedia.org/wiki/Collision_resistance

For instance, just because it is easy to find data that would collide, doesn't necessarily mean its likely for those collisions to be valid files.

How collision resistant is xxhash? This thread has some insights:

https://github.com/Cyan4973/xxHash/issues/165

jtrakk commented 3 years ago

If dvc tells me I got "data.csv with checksum 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4", I want to be confident that I have the right file, not one that's forged or unlucky.

wiki lists some features to consider.

The ideal cryptographic hash function has the following main properties:

  • it is deterministic, meaning that the same message always results in the same hash
  • it is quick to compute the hash value for any given message
  • it is infeasible to generate a message that yields a given hash value (i.e. to reverse the process that generated the given hash value)
  • it is infeasible to find two different messages with the same hash value
  • a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value (avalanche effect)

I think all of those properties are desirable in DVC. There are fast and strong hash functions available for that purpose and DVC can use them. Some are even in the Python standard library.

Erotemic commented 3 years ago

Right, dvc only needs some and not all of those properties. Again don't confuse collision resistance with cryptographic security.

  • it is deterministic, meaning that the same message always results in the same hash

xxhash has this

  • it is quick to compute the hash value for any given message

xxhash is better at this

  • it is infeasible to generate a message that yields a given hash value (i.e. to reverse the process that generated the given hash value)

most dvc use cases do not care about this property.

  • it is infeasible to find two different messages with the same hash value

most dvc use cases do not care about this property.

  • a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value (avalanche effect)

xxhash and many other non-cryptographic hashing functions have this property.

What is not listed in here is that any dvc hash function should have the property: for any two files A and B, the probability that is low p(h(A) = h(B)) < threshold. And xxhash, md5, and sha256 have this property. Perhaps xxhash will give collisions on HUGE amounts of data, but the normal terabyte use case will be fine.

If you are worry about forged files, then yes you should configure dvc to use a secure hashing function (like sha256, and certainly not md5). So as it stands dvc is vulnerable because md5 is vulnerable. But I don't care about forged files. Are you worried about adversaries? If you are consider signing your data with gpg keys. That will be more secure.

adrinjalali commented 3 years ago

Adversaries are a thing which makes DVC as it is now vulnerable. I would love to be able to recommend DVC in teams and orgs I work with, but as it is now, I can't. There is a real need for many organizations that their tools are secure and the hash issue in DVC is something security people notice quickly.

I think the moral of this issue is that there are quite a few users who are requesting this feature, so it would be nice if it were to be prioritized 🀷

pmrowla commented 3 years ago

This issue is on our roadmap and will likely be addressed at the same time as https://github.com/iterative/dvc/issues/829. When we have support for chunking in DVC, we will move away from MD5 for identifying chunks, and will use some other cryptographically secure hash. SHA256 is one option, but we are also considering faster options (like the Blake variants).

Erotemic commented 3 years ago

@adrinjalali What is the threat model that DVC is realistically insecure against?

If an organization is hosting a DVC cache on its internal servers, then an adversary would need to gain access to those servers before they could forge a potentially malicious file that has the same hash as an expected file. If the adversary has access to your internal servers, your organization has bigger problems. I suppose this is more of a real issue for very large corporations where insider threats are more of an issue, but even so, controlling write permissions on the cache should mitigate most concerns.

Is there an attack or threat model that I'm not considering?

@pmrowla I hope that whatever hashing algorithms are being discussed are just for the defaults and in general the hashing algorithm will be configurable. In use cases like mine where everything is behind protected servers, I really don't care about having a cryptographically secure hashing algorithm. I just want the fastest one with a collision probability less than an epsilon threshold.

adrinjalali commented 3 years ago

Yes, if the adversary has access to the servers that's a big issue. But this is about having multiple levels of security in your system, instead of only relying on your network security. Each component gets audited separately about their attach surface, and MD5 (or any other weak hash) is just a red flag raised by auditors.

This is about an adversary being able to change something w/o it being easily noticed. And changing a file while keeping the same hash, is a really hard one to detect, if the hash is weak.

ktarplee commented 3 years ago

I would like to add that most government systems require FIPS 140-2 crypto. All insecure hashes are excluded (including MD5) as well as some secure hashes that are are not "blessed" by NIST. SHA-2 family of hashes are included in FIPS 140-2. For DVC to have a chance of running on those systems you need to a least support a hash that is in the FIPS 140-2 list of approved ciphers. It does not matter to the auditors how you use the hash (e.g., if you don't care about preimage resistance), you still will not pass the audit. Furthermore MD5 (and others like xxhash) are simply not implemented in OpenSSL in FIPS 140-2 mode and thus might not have a verified and efficient implementation available to python.

efiop commented 3 years ago

@ktarplee Great points!

Erotemic commented 3 years ago

Furthermore MD5 (and other like xxhash) are simply not implemented in OpenSSL in FIPS 140-2 mode and thus not available to python as an efficient implementation.

@ktarplee xxhash has a super efficient python implementation: https://github.com/ifduyue/python-xxhash-cffi That's how I got the times I cited in the above benchmark. Perhaps you meant are not available to Python as a NIST-compliant hash? In that case yes, and I would want to reiterate that nobody should be using xxhash or md5 if they actually need a cryptographic hashing algorithm.

It does not matter to the auditors how you use the hash ... you still will not pass the audit.

I think this is understood by people in this conversation, but for readers I want to explicitly differentiate "passing an audit" from "causing issues in real world applications". If you will fail an audit just because you use a non-cryptographic hash at any point in the program, that is a problem with the audit methodology and not the actual security of the stack. Yes, there is something to be said for "multiple layers of security", but in reality the government requirements discussed here surmounts to nothing more than lazy auditing where they can throw a red flag because they matched some criterion on a checklist. This sort of audit does not consider the context in which the non-cryptographic hash is used and will throw any number of false alarms to maximize the number of real security threats that are found.

Under these lazy audit criteria, you shouldn't even be allowed to use Python dictionaries, because they use a non-cryptographic hashing scheme:

https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Python/pyhash.c#L39

And yes, there are even security implications to this: https://static.usenix.org/event/sec03/tech/full_papers/crosby/crosby_html/ (some of which listed here have been mitigated).

But most government projects don't disallow the use of Python. Instead they simply compartmentalize the security risk such that an adversary simply doesn't have access to it.

Note that the use-case in DVC is a little different. Security of the hashes is a concern if you were to host a dvc cache publicly. But I balk at the idea that something would fail an audit just because it uses a non-crypto hash without considering the context in which that non-crypto hash is used. My goal in writing this is (1) to make sure people understand there is a bit more nuance that goes into situations that really require a cryptographically secure hash and (2) to make sure the DVC dev's don't force us to use a crypto hash (although a crypto hash should definitely be the default!) and will let us configure DVC to use something faster like xxhash if our use-case and threat model allows for it.

jtrakk commented 3 years ago

(2) to make sure the DVC dev's don't force us to use a crypto hash (although a crypto hash should definitely be the default!) and will let us configure DVC to use something faster like xxhash if our use-case and threat model allows for it.

Often when a tool allows users to configure it for a lenient environment, the low-threat configuration gets propagated into higher-threat environments. Configurability can be a problem when it can get non-experts into trouble. For example, a user googles "why does hashing take a long time?" and copies the answer from Stack Overflow without consulting a security expert. Configurability increases the dimensionality of a situation, sometimes enough even to trip up experts. For example, high-security user Alice accepts files from low-security user Bob, who's been using insecure protocols, thereby compromising the integrity of Alice's data.

FWIW my own informal timings:

# ~1GB data file
$ for prog in sha256sum md5sum b2sum xxh64sum b3sum; do echo $prog ; (command time -f %e $prog data >/dev/null) ; echo; done
sha256sum
6.58

md5sum # insecure
2.64

b2sum
2.10

xxh64sum # insecure
0.22                                                                   

b3sum
0.06
shcheklein commented 3 years ago

@Erotemic @ktarplee @jtrakk an excellent discussion guys, very interesting to read!

I find this link mentioned above very useful and the way Git handles the migration to SHA-256 - https://stackoverflow.com/questions/28159071/why-doesnt-git-use-more-modern-sha/47838703#47838703

It feels also there is no one fits all solution? Or even if we agree on a hash we'll use in DVC 3.0 we should probably abstract it out like Git does. And/or allow users plug their own implementation after all.

xkortex commented 3 years ago

If you are going through the process of parameterizing the hash algo, you still ought to enforce some set of vetted algorithms/implementations. Compared to the difficulty of preimage/collision attacks, it's far too easy to compromise either the configuration which sets the hash function, or tweak the metadata such that it accepts a weaker hash.

MD5 is the weakest of the hashlib available algos, but will likely be supported by DVC for a long time, being the current default. So imagine once the default is upgraded to SHA256, a .dvc file is encountered with a valid-looking md5 checksum. Naively, the software will parse - md5: 699dc9e031aaf17f19b4e7ccff345002, select md5 from hashlib, and verify. All looks A-OK. But this is a recent codebase, and it's using SHA256 by default, and in fact both the source file and .dvc metadata are malicious. There should be a config which enumerates the allowed algorithms.

Re: xxHash - that's neat, but if you are looking for performance, blake3 provides arguably "good enough for DVC purposes" (~10x MD5 and SHA256), but with the huge advantage of being an actual cryptographic hash and based on the BLAKE platform, with a lot of history behind it (itself based on ChaCha). Blake3/Bao also supports a streaming protocol, which while currently in its infancy, would be extremely useful for incremental hashing of large files a) you get a huge speedup when two files diverge and b) you could conceivably have an append-only file format and de-duplicate earlier blocks.

iesahin commented 3 years ago

My 2c:

xkortex commented 3 years ago

I agreed that the fewer degrees of freedom, the less complexity there is. However I still see two different points of parameterization - within the DVC codebase, and user-configurable. The former just implies it would be straightforward to swap out the hash algo used by the codebase - i.e. minimize reliance on hard-coded values. I think this ought to happen regardless, as MD5 should be sunsetted. Replacing MD5 with SHA256 still implies the ability of DVC to read a metadata checksum name and value and validate it.

It's a second discussion whether to expose that ability to user-configuration.

Personally, I don't that much trouble with exposing this as a config. The least-effort route is just accept anything in algorithms_guaranteed and pass it to hashlib.new, {'blake2b', 'sha3_256', 'blake2s', 'shake_128', 'sha3_384', 'sha384', 'sha512', 'sha3_224', 'sha1', 'md5', 'shake_256', 'sha256', 'sha3_512', 'sha224'} on 3.6, the minimum version supported by DVC. Pretty much offload the maintenance and security burden to Python org XD.

Catch the ValueError: unsupported hash type foobar and make it slightly more user-friendly and you're done. Basically just something like:


try:
    hasher = hashlib.new(name) 
except ValueError: 
    raise DVCConfigError('"{}" is not a valid hash algorithm name. Acceptable choices are {}'.format(name, hashlib.algorithms_guaranteed))  # or some whitelist

Documentation: "DVC default_hash can be configured to be one of the following: ..."

Security: it's already using MD5. Hard to go worse from there (if ensuring usedforsecurity=True).

Allowing pluggable hash algos, e.g. to support Blake3, yeah I can see that being a can of worms.

iesahin commented 3 years ago

Let me give an example:

Suppose Alice and Bob share a remote. Alice configures to use Blake3, Bob wants to use SHA256.

Now, we don't have any metadata in remotes that show which hash algorithm is used. They are simple directory trees with a deterministic structure. How will we solve this? We need to keep metadata in remotes. That means all remote and cache-related code should be updated just to support this.

Alice has a local cache in Blake3, has a remote shared with Bob in SHA512, Bob has a local cache in SHA256. When one of them wants to send a file to the other, a conversion is needed and where, and why?

Their remote is on a Windows server and suppose they put it to a C:\Very\Very\...\Deep\Dir\ and guess what, they discover that Windows has a restriction of 1024 characters for paths. πŸ˜„ Or, their directory is exactly 510 character deep and it works for normal files, but it fails for .dir files and we can't understand why.

DVC's all data sharing and versioning code needs an update and all these algorithms will need separate tests. We'll have to ask which hash algorithm do you use? in new tickets. We'll replace each "MD5" occurrence with at least a few sentences in the docs. If some of these algorithms will have export restrictions to some countries, DVC will have to maintain separate branches/controls/whatever for each, because when you start to support n algorithms, you just can't go back and say err, we actually didn't mean this one.

Hashing part should be abstracted away and if a new one appears just better for DVC, we should be able to use it without much change in the codebase. I also fully support the forks modifying this layer for their own use but allowing users to change the algorithm in .dvc/config is not something I'd dare.

Thank you.

da2ce7 commented 2 years ago

The choice of MD5 is curious. Git is build upon the (also broken, but alas, far-far more secure SHA-1). Why choose a hash that has been known to be broken from day 0???? One could start to being suspicious.

The the blasΓ© urgency that this issues has been treated with here is again verging on suspicious.

We should upgrade to Blake 3 as a matter of security urgency.

MD5 is a security critical failure and should be addressed as such!

--

CC: @josecelano, @yeraydavidrodriguez

da2ce7 commented 2 years ago

@iesahin

Let me give an example:

Suppose Alice and Bob share a remote. Alice configures to use Blake3, Bob wants to use SHA256.

We need to stop having a completely critical security vulnerability as a matter of urgency. The goal of supporting multiple hashes is nice, however that is a far secondary concern from moving away from MD5.

We need it pick any not-broken hash and do a one-way move as a matter of desperate urgency. Later we can think about if we want support of multiple hashing algorithms. That is a different topic to addressing this security vulnerability.

DVC's all data sharing and versioning code needs an update and all these algorithms will need separate tests. We'll have to ask which hash algorithm do you use? in new tickets. We'll replace each "MD5" occurrence with at least a few sentences in the docs. If some of these algorithms will have export restrictions to some countries, DVC will have to maintain separate branches/controls/whatever for each, because when you start to support n algorithms, you just can't go back and say err, we actually didn't mean this one.

You are simply wrong here. There are no practical export restrictions on crypto in this world anymore. This fear is about 20 years out of date. Remember when the wonderful people published the PGP physical book, with the source code printed inside?

iesahin commented 2 years ago

We need to stop having a completely critical security vulnerability as a matter of urgency. The goal of supporting multiple hashes is nice, however that is a far secondary concern from moving away from MD5.

I agree completely.

There are no practical export restrictions on crypto in this world anymore. This fear is about 20 years out of date.

Laws change. That was an example though. More algorithms mean more points to consider, more options require more updates and more maintenance in the future.

efiop commented 2 years ago

Guys, we've been using md5 for historical reasons. Initially, we only had pipeline management where md5 was used as a checksum and not a hash, so it was perfectly suitable for the task. Data management grew from it and kept using md5. We are well aware that md5 is not a great hash, and will be switching to something more suitable (likely some SHA) in 3.0, so stay tuned.

da2ce7 commented 2 years ago

@efiop

Guys, we've been using md5 for historical reasons. Initially, we only had pipeline management where md5 was used as a checksum and not a hash, so it was perfectly suitable for the task. Data management grew from it and kept using md5. We are well aware that md5 is not a great hash, and will be switching to something more suitable (likely some SHA) in 3.0, so stay tuned.

Why don't you release a security update that updates to https://github.com/BLAKE3-team/BLAKE3 (my recommendation) as a matter of priority?

Having this sort of security vulnerability addressed 'in the next major update' isn't very professional IMHO...

Erotemic commented 2 years ago

@da2ce7 While I agree DVC needs to move away from MD5 ASAP, I disagree that the choice to delay to the next major update is unprofessional. Changing the hashing algorithm is a major change that will require existing users to rename all of their files in the dvc-caches. This is not trivial.

In fact IMHO it is highly professional to be mindful of semantic versioning when making large breaking changes like this, even in the case of security.

As I mentioned beforehand, the primary use-case of DVC does not require a cryptographically secure hashing algorithm. The attack vectors in the most common cases are niche at best (note: I recognize there are serious concerns in some cases, but my argument is that these are not common). Much of the security community will pounce on anything that would be insecure in some context without really ever considering the real-world threat model in which that thing is applied. I think that is a mistake. It certainly doesn't warrant getting personal.

Regardless, I would second an urgent push to release DVC 3.0 with BLAKE3 as the default (and possibly only) hash algorithm.

efiop commented 2 years ago

I agree with @Erotemic on his points.

One thing about blake though, is that it is not FIPS/NIST certified, which is a significant problem for people that need to comply with those, so I doubt that we will choose blake by default, and hence why I think that some version of SHA is more realistic.

One more thing is that sometimes people confuse checksums we use for pipeline checks with (hash-ish, hashing function not cryptographic hash) hashes that we use for data management. For the former one, we will support other custom checksums that could be more relaxed than md5 that depends on the full binary content.

We've been working hard on 3.0 pre-requsites for the past half a year or so, and are continually working on those as we speak. We not only want to switch to a different algorithm in 3.0, but to also provide better performance/ui/architecture/ecosystem for data management, and all of that while not seizing releases with new features (experiements, dvc machine, plots, etc) and bug fixes for 2.0, so we've been gradually rebuilding that and will likely be ready for 3.0 in the upcoming months.

We really truly appreciate your interest πŸ™

da2ce7 commented 2 years ago

@Erotemic

In fact IMHO it is highly professional to be mindful of semantic versioning when making large breaking changes like this, even in the case of security.

In the context of semantic versioning making a breaking change should always be signalled by a major number change. However, I do not feel like that was the core of the statement that @efiop was making. He wasn't referring to the fact that this breaking change will be called 3.0, but that this change will be included in 3.0.

As I mentioned beforehand, the primary use-case of DVC does not require a cryptographically secure hashing algorithm. The attack vectors in the most common cases are niche at best (note: I recognize there are serious concerns in some cases, but my argument is that these are not common). Much of the security community will pounce on anything that would be insecure in some context without really ever considering the real-world threat model in which that thing is applied. I think that is a mistake. It certainly doesn't warrant getting personal.

The problem with MD5 is that it is desperately broken, and somebody could commit to a hash of good file, and then later silently swap out that file to an evil variant, (if they control the server).

@efiop

One thing about blake though, is that it is not FIPS/NIST certified, which is a significant problem for people that need to comply with those, so I doubt that we will choose blake by default, and hence why I think that some version of SHA is more realistic.

Now that is a very niche use case. FIPS/NIST is normally enforced in the core security aspects of a system, such as: when hashing passwords, or for the digest of of a cryptographic signature.

In the case for DVC the use case is different: Here we use the cryptographic hash to assure that there is only a single matching blob of data that can be referenced b any single commit within the git repository.

DVC routinely deals with large files. So the performance of the hashing algorithm is of quite some importance:

All three SHA variants are not very good for our case.

In addition, SHA-1 and SHA-2 both are not parallelizable, and are subject to length extension attacks.

On the other hand, BLAKE3 is perfect for our case.

It is extremely fast, and very secure, and in particular it is infinitely parallelizable so we can get speedup form multi-cored hardware for large files.

We really truly appreciate your interest πŸ™

Thanks, our project is considering the use of DVC and the choice of the hashing algorithm naturally is an important consideration for us.

iesahin commented 2 years ago

I also believe this change requires more testing and a major version upgrade.

BLAKE3 has technical advantages but having FIPS/NIST certification is an important point for some of the organizations. Setting the default for SHA-256 with possible versions that support other algorithms might be a good solution. Installing like pip install dvc may install the SHA-256 version, but we may also have pip install 'dvc[blake3]' for a faster version.

da2ce7 commented 2 years ago

@iesahin

I also believe this change requires more testing and a major version upgrade.

BLAKE3 has technical advantages but having FIPS/NIST certification is an important point for some of the organizations.

Setting the default for SHA-256 with possible versions that support other algorithms might be a good solution. Installing like pip install dvc may install the SHA-256 version, but we may also have pip install 'dvc[blake3]' for a faster version.

For the very few organisations that are constrained by the FIPS/NIST certification, they could use a different version of dvc. - You are suggesting that we should cripple the speed of the default because of a hand-waving concern about FIPS/NIST certification.

There will is a dramatic difference in the performance between SHA256, and BLAKE3. I think the default should be to have good performance.

If, and only if, you are forced to use FIPS/NIST certificated hashes for this part of your company Infrastructure, then you could use SHA-512-256. (SHA-256 would be silly, as it is slower than SHA-512 for everything other than the smallest of strings).

Primarily, can you show of any interested organisations that require FIPS/NIST certificated hashes for DVC files that they have in their git-repo? Is this actually a realistic goal? Is this such an important case to make it the default??

Otherwise, I suggest this is just FUD about FIPS/NIST certification requirements.

iesahin commented 2 years ago

can you show of any interested organisations that require FIPS/NIST certificated hashes for DVC files that they have in their git-repo?

I can't because they probably can't use DVC now. :)

For the very few organisations that are constrained by the FIPS/NIST certification, they could use a different version of dvc.

This is a good idea too. I'll leave the decision to the project maintainers.

Erotemic commented 2 years ago

For the very few organisations that are constrained by the FIPS/NIST certification

@da2ce7 I think you underestimate how many organizations that is. I work at an open source company and even I'm impacted by that (as much as it irritates me). Often it's not about actual security in these cases. Someone see's OMG you're using a hash, it better be one in our passlist! Then it devolves into pseudo-software hell from there.

The gov funds a lot of companies, and to be eligible you have to jump through their hoops, even if they don't make sense. (I wasn't able to use my ed25519 ssh key to login to a server recently. There was no way I was going to use the NIST curve, so I had to make a new RSA key. It felt a little dirty.)

I would really love something configurable in this case (and IMO it would be best to design in configurability from the ground up). I'd love to see blake3 as the default, and then a sha256 option as a hedge against FIPS/NIST certification issues. My prediction is that the probability they will come up and cause issues is high. But I think you're right that we shouldn't cripple DVC's speed in the default case because of short-sighted policy decisions that have a lot more impact that you might think if you aren't exposed to them.

Now that is a very niche use case. FIPS/NIST is normally enforced in the core security aspects of a system, such as: when hashing passwords, or for the digest of of a cryptographic signature.

This is the only thing that gives me doubt. The wording in these certification guidelines is often vague and open to interpretation (quite the opposite of where I like to be). It could be the case that we argue DVC isn't a core security aspect (and based on my earlier arguments about why a crypto hash isn't always necessary for DVC, I would agree with that).

@iesahin your solution is effectively configuration of hashes, which I don't know if that is in the roadmap. I think the arguments here are showing that even if there is one technically superior hash (blake3), there are meatspace tradeoffs that need to be considered. If configuration is not on the roadmap, I suggest re-evaluating that decision. My opinion is that blake3 is the default and sha256 is offered for the pour souls who need it.

iesahin commented 2 years ago

your solution is effectively configuration of hashes, which I don't know if that is in the roadmap

I'm not in a position to make these decisions.

What I'm saying is not setting an option in .dvc/config and change the hash algorithm. It should be done like a fork, and the fork may get lower level of attention in support. This will also make the main branch to adopt newer algorithms easily.

shcheklein commented 2 years ago

Looks like Linux is using blake now:

https://lore.kernel.org/lkml/20211223141113.1240679-2-Jason@zx2c4.com/

It sounds reasonable to make it configurable and make blake default.

iesahin commented 2 years ago

If it's really required to make it configurable, I think making the digest size fixed (256 bits = 32 bytes) allows to abstract digest -> path conversion. Instead of MD5, DVC may require to use "any hash function that outputs 256 bits."

I still think making it configurable won't add much value, and using SHA2-256 is the most obvious option due to NSA blessings. Longer hash function digests may also be problematic for file system paths, 512 bits means 64 bytes encoded with 128 characters in hexadecimal. In this case .dvc/cache/12/....EF will have 138 characters. Windows systems have 255 character limit for paths. (Encoding hashes in base64 to convert to paths might be possible but NTFS also doesn't allow case-sensitive file names by default.)

bobertlo commented 2 years ago

I would like to point out that with hash configuration any future change would then not be a breaking change requiring a major version release.

I've just been watching this thread because the text/binary distinction in the hash has been limiting to my use cases. I don't think the use of MD5 is that bad or needing of urgent change, but it would be nice to be able to use another hash. I would personally be a lot more comfortable with something like SHA-256 (or 512-256) vs blake3 if there was not a choice. I like to be able to access the data externally and sha256 is generally more available than blake3 (i.e. it is in the go standard library)

Erotemic commented 2 years ago

Because the discussion doesn't seem to reference this issue when I raised it, I'll note that I think a configurable hash could solve the main issue with immutable storage backends like IPFS as discussed here: https://github.com/iterative/dvc/discussions/6777#discussioncomment-1904292

If I run: ipfs add -nq setup.py I get the "hash" (really an IPFS CID): QmdsJ5Tn6E78KrGwvkrNRSoA9QvmX2iieo7Zv3GmVUcNMf, which is entirely based on the file contents. If the files in the .dvc/cache were stored with these IPFS "CIDs" then the issue of needing to know what the name of the file is on the remote is solved. The way I envision it, IPFS backend would require the IPFS-hasher, which allows for implementing an IPFS variant of dvc.fs.base.FileSystem.

Erotemic commented 1 year ago

@efiop Has there been any movement towards any of the ideas discussed here? I have an idea that I'd like to test out. Having an API that specifies the hash is necessary for it. Is a PR that adds a configurable hash API that is hard-coded to MD5 to start with (adding more hash algorithms can be discussed later) something that would be considered?

robotrapta commented 1 year ago

I would hope that the migration to a newer hashing algorithm would be done in a backwards-compatible way. Saying it has to wait until 3.0 because it's a breaking change implies something pretty troubling - does that mean dvc3 can only use new hash and won't work with existing stores? @shcheklein @efiop (I sure hope not.)

For organizations that have really invested in using DVC, migrating their existing stores from one hash func to another will be at least a big hassle and perhaps completely intractable. At the very least you need to redo the entire storage - which is certainly expensive but possible. But then you also need to track down every .dvc file which references it, and those could be scattered to the winds in places that can't all be found reliably.

The right way to do this doesn't require waiting for a major version. You just add an option today which is "new hash / old hash" when creating a new repo. In 3.0 you can/should make "new hash" the default. IMHO it would be very reasonable to offer this option for new hash without any migration tool whatsoever. It just means people who start using dvc today are better off. And those who have legacy MD5 data stores can either figure out the migration themselves or stick with what they have.

brohee commented 1 year ago

Seeing as I just stumbled upon this and just experimented, lets just be very clear with everyone: if your dataset includes files with the same MD5 sum, only the first one will ever be pushed to the remote, the others will be falsely deduplicated. It's super easy to test with the bins from https://github.com/corkami/collisions. Given how easy the files are to produce and the potential impact, I don't think this issue is treated with nearly the urgency needed. I'd say use Blake3 already, FIPS be damned. (possibly a bit more discussion at https://pouet.chapril.org/@brohee/110537633498422926)

efiop commented 1 year ago

Hey folks, thanks for your interest in this. We've been working hard towards making hash configurable in our internals and at this point pretty much only missing an actual config option exposed to the public and some high-level stuff (and obviously some polish around it). We don't have an immediate plan to work on that right now, but it is on our radar and we will be considering it in the near/mid future. Will keep everyone posted on the changes here.

Erotemic commented 1 year ago

The new <cache>/files/<hash> folder in dvc 3.0 has been great so far; @efiop, you and the rest of the iterative team have done great work with that and all of the surrounding revamps!

Having spent the weekend slogging through md5 hashes, I'm so excited for blake3 and/or xxhash

image

image

lefos99 commented 8 months ago

Nice this Blake3 cryptographic hash function looks really promising! :love_you_gesture:

derdre commented 8 months ago

I would really like the option to use faster hash algorithms, e.g. Blake3. Especially in a project with a lot of files, this could accelerate my ML pipeline significantly. Do you think this will be added anytime soon? Or is there a way to overwrite md5 already now?