rstudio / pins-r

Pin, discover, and share resources
https://pins.rstudio.com
Other
312 stars 63 forks source link

Possible to hash multiple files in language independent way? #599

Closed machow closed 1 year ago

machow commented 2 years ago

Currently, pins seems to use two strategies for hashing:

One challenge for porting the second hash strategy to python is that it seems hash serializes objects first, and digest will serialize by default. Below is an example.

Example

echo abc > test.txt

R code

# file: same as python -- 44bc2cf5ad770999
digest::digest(file="test.txt", algo="xx64")

# different
digest::digest("abc", algo="xx64")

# same -- 44bc2cf5ad770999
digest::digest("abc", algo="xx64", serialize=FALSE)

Python code

# pip install xxhash
import xxhash

# 44bc2cf5ad770999
xxhash.xx64(open("test.txt", "rb")).hexdigest()

# 44bc2cf5ad770999
xxhash.xx64(b"abc").hexdigest()

Possible fix

I think the following steps should be language independent (but I could be very wrong here):

juliasilge commented 1 year ago

Should we stick with xxHash or use MD5, which is (mostly) what is available on cloud platforms like AWS and GCP? I understand that xxHash is much faster, but if we are going to revamp hashing (related to #589) would it be more useful to store a hash that could be compared on common systems?

Related to that, I am not seeing the same results as what is above for files (differences in file writing???) ⤴️:

## which ones are the same for MD5?
digest::digest(file = "test.txt", algo = "md5")
#> [1] "0bee89b07a248e27c83fc3d5951213c1"
digest::digest("abc", algo = "md5")
#> [1] "0aa9c59ea893e51a8cc55e8ea353e592"
digest::digest("abc", algo = "md5", serialize = FALSE)
#> [1] "900150983cd24fb0d6963f7d28e17f72"

## which ones are the same for xxHash? NONE
digest::digest(file = "test.txt", algo = "xxhash64")
#> [1] "e8a1523b824c6e2d"
digest::digest("abc", algo = "xxhash64")
#> [1] "f4acd87c52d4e62b"
digest::digest("abc", algo = "xxhash64", serialize = FALSE)
#> [1] "44bc2cf5ad770999"

Created on 2023-02-23 with reprex v2.0.2

In Python, I see:

import xxhash

## same as R, but not as each other:
xxhash.xxh64(open("test.txt", "rb").read()).hexdigest()
#> 'e8a1523b824c6e2d'
xxhash.xxh64(b"abc").hexdigest()
#> '44bc2cf5ad770999'
import hashlib

## same as R, but not as each other:
hashlib.md5(open("test.txt", "rb").read()).hexdigest()
#> '0bee89b07a248e27c83fc3d5951213c1'
hashlib.md5(b"abc").hexdigest()
#> '900150983cd24fb0d6963f7d28e17f72'

This is after doing echo abc > test.txt in the working directory. I guess this isn't too important because we are going to be either hashing objects or files in similar ways. Still weird that I see something different!

For concatenating hashes together (to replace hash(hashes)) something like this will work:

digest::digest("abcdef", algo = "md5", serialize = FALSE)
#> [1] "e80b5017098950fc58aad83c8c14978e"
digest::digest(glue::glue_collapse(c("abc", "def")), algo = "md5", serialize = FALSE)
#> [1] "e80b5017098950fc58aad83c8c14978e"

Created on 2023-02-23 with reprex v2.0.2

machow commented 1 year ago

I think that the echo command I gave adds a newline character 😬. It might need something like this?:

echo -n abc > test.txt

Should we stick with xxHash or use MD5 ...?

I feel pretty paranoid about bad things that could happen by switching the overall hashing algorithm, but don't have anything specific in mind (I guess it could make comparisons between old-hash and new-hash pin versions hard?).

It seems like one potentially big advantage of md5 is that we could use it for names in the pins cache. But since the pins cache isn't flat, but has the structure "<board_name>/<pin_name>/<pin_version>/<content...>", it seems like it'd be tricky to get benefit from retrieving the md5 hash from a bucket?

juliasilge commented 1 year ago

Oh of course, the newline! 🙈 I had assumed you were sharing the actual code you had run and something weird was happening on my end.

I feel pretty paranoid about bad things that could happen by switching the overall hashing algorithm

Yeah, for sure, me too. This is not a small or minor thing to consider. Dealing with #589 is a big enough project, though, that we should figure out the really most useful way to change the hashing strategy.

I do think that there are enough potential advantages to MD5 to consider it; I'll work up an implementation and see how that goes.

juliasilge commented 1 year ago

After some more time and seeing what folks' challenges are, we now think that this isn't likely to be a problem for typical pins R + Python use cases. It could come up as a problem when a team wants to write to the same pin from both R and Python (for example, picture someone sometimes wanting to update a parquet file using R and sometimes using Python); this is not a common pattern currently.

We can revisit this in the future if it comes up for users. If we do, the main thing to look at is when the current strategy in R serializes before taking the hash (or uses rlang::hash()). In R:

## different from Python because serializes by default:
digest::digest("abc", algo = "xxhash64")
#> [1] "f4acd87c52d4e62b"

## either of these are OK:
digest::digest(file = "test.txt", algo = "xxhash64")
#> [1] "44bc2cf5ad770999"
digest::digest("abc", algo = "xxhash64", serialize = FALSE)
#> [1] "44bc2cf5ad770999"

Created on 2023-04-27 with reprex v2.0.2

In Python:

import xxhash

## same:
xxhash.xxh64(open("test.txt", "rb").read()).hexdigest()
#> "44bc2cf5ad770999"
xxhash.xxh64(b"abc").hexdigest()
#> "44bc2cf5ad770999"
machow commented 1 year ago

I'm on board with closing this issue, but want to clarify the issue raised here. This issue only concerns differences in hashing when hashing multiple files.

Here are the important points of single file hashing (which is not the issue I meant to raise here):

For writing single files, the last piece will produce issues for teams that both want to write from R and python, but is not about hashing (even though they use identical single file hashing strategies, different hashes occur because it's different data written to file).

Edit:

My note about serialize=TRUE in the original issue, is a suggestion for how calculating the hash of multiple hashes could use xx64, and produce identical results as if we used xx64 in python for multiple files. Sorry, I realize this is a weirdly cumbersome to check situation in pins 😅.

github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.