icerpc / slicec

The Slice compiler library
Apache License 2.0
13 stars 5 forks source link

Sha256 hashing for SliceFile added #696

Closed ReeceHumphreys closed 3 months ago

ReeceHumphreys commented 3 months ago

This PR adds a compute_sha256_hash method to SliceFile. This function is used to generate a hash that will be stable across releases. Given my comment below, should we re-export sha2 from slicec since I want to use it in slicec-cs? This would guarantee the version slicec-cs uses will always be identical to the slicec version.

ReeceHumphreys commented 3 months ago

Note: The idea with this PR is that it will enable me to do something like this in slicec-cs

// Hash the vector of slice files using the sha256 algorithm.
let final_hash = compilation_state
    .files
    .iter()
    .map(|file| file.compute_sha256_hash())
    .fold(Sha256::new(), |mut hasher, file_hash| {
        hasher.update(file_hash);
        hasher
    })
    .finalize()
    .map(|byte| format!("{:02x}", byte))
    .into_iter()
    .collect::<String>();
InsertCreativityHere commented 3 months ago

The idea with this PR is that it will enable me to do something like this in

All good! Just wanted to say that the array returned by finalize can also be directly hex-ified though. Not sure what the exact format is, but try printing the result itself (with {:032x}) and see if it looks reasonable. If it does, then the last 3 functions in your chain can probably just removed.

    .finalize()
    .map(|byte| format!("{:02x}", byte))
    .into_iter()
    .collect::<String>();

But maybe it will output things in a way we don't like. I don't remember, it's been a while since I've done hashing in Rust. Worth a try though IMO!


Not a typo. Changing your 02 changed to 032 was intentional since a single byte takes up 2 digits, whereas an SHA256 hash should take up 32 digits.

externl commented 3 months ago

I think it makes sense to have this in slicec with a helper function to hash a list of files. In the future we may want to include this "compilation hash" as part of the serialized AST.

InsertCreativityHere commented 3 months ago

It's not obvious to me if we should only include the file contents and not include file name/ path.

Yes, an important question is: "what are we trying to measure here?" Without an answer to that, it's not clear what we should and shouldn't be taking here. I'd assumed I'd missed a discussion on this, but maybe not.

InsertCreativityHere commented 3 months ago

In the future we may want to include this "compilation hash" as part of the serialized AST.

Sure, it can live in slicec, but I would be surprised if it's doesn't end up being dead code. Even if we decide to include a hash alongside the serialized AST, it would probably include more information that the hash we're computing for our telemetry stuff.

And moving it would solve the concern of crate versions falling out of sync. Because I would really prefer not to re-export sha2 from slicec. Adding the entire crate to our public API, for this very niche function seems pretty extreme to me.

ReeceHumphreys commented 3 months ago

I moved what was originally going to be in slicec-cs into a helper function here instead. I implemented it just for slices of SliceFile. We could do it more generically but its not necessary imo.

externl commented 3 months ago

In the future we may want to include this "compilation hash" as part of the serialized AST.

Sure, it can live in slicec, but I would be surprised if it's doesn't end up being dead code. Even if we decide to include a hash alongside the serialized AST, it would probably include more information that the hash we're computing for our telemetry stuff.

I'm not really following what yo mean here. Once of the things we can include in the response would be a hash of the src + reference files. No need for each language to try and compute (read all the files and figure what what's needed) this itself. We'll likely want to try and exclude unused reference files.

And moving it would solve the concern of crate versions falling out of sync. Because I would really prefer not to re-export sha2 from slicec. Adding the entire crate to our public API, for this very niche function seems pretty extreme to me.

externl commented 3 months ago

I think it's fine to just include