API for incremental model re-hashing

laurentsimon commented 7 months ago

Provide an API that allows re-computing the hash of a subset of files in a model. This is useful in cases where only a (small) set of files have changed, eg if a user has updated a README in a huggingface repo with files of several hundred of GB.

The API may look something like the following:

sign(..., updated_files)

with

updated_files = list [
   (src_file, action, dst_file)
]

Example:

updated_files = [
  "path/to/file1", "RENAMED", "path/to/file2",
  "path/to/file3", "DELETED", "",
  "path/to/file4", "UPDATED", "path/to/file4"
]

Something along these lines. The API would use the existing manifest file and update it by re-computing only the hashes for files that need to be re-hashed.

laurentsimon commented 6 months ago

A simple first iteration could be to support just a list of files to recompute a hash for. The serializer would list files on disk, and for those that needs to be recomputed, it would re-serialize them. For the others, it would take it from the existing signature file.

mihaimaruseac commented 6 months ago

I think we always need 2 lists: one for files that got deleted and one for files that need to be rehashed.

For a deleted file, we just remove it from the manifest, error if it was not present.

For files that need to be rehashed, error if they're not present, recompute hash otherwise.

Moving/renaming files and just copying the hash will speed up the update process as we won't need to re-read the file's content, but we can risk a misuse of the API where a file that's both moved and modified will have the wrong hash.

laurentsimon commented 6 months ago

I think we always need 2 lists: one for files that got deleted and one for files that need to be rehashed.

For a deleted file, we just remove it from the manifest, error if it was not present.

For files that need to be rehashed, error if they're not present, recompute hash otherwise.

Depends what semantics we define for the list. If we allow files in the list to not be present on disk, it would allow the caller to pass a list of changed files (either deleted, renamed or added) without the need for them to know the details of what was changed. Our lib would figure what to recompute by listing existing files on disk and using the list to determine which to recompute (deleted files would simply be ignored since they won't be on disk).

I'm thinking that's as simple as it gets for callers. They would be able to use existing git command to list files changed, without parsing what sort of changes were made. To be honest I have not not played with git yet to see what command is needed to get the diff between a commit and the latest push :)

A slightly more advanced solution is for callers to pass the type of changes (either via 2 lists as you suggest) or via a more structured parameters (in the original description).

mihaimaruseac commented 6 months ago

Oh, you're thinking from the point of view of the CLI usage. I was thinking from the point of view of a ML training pipeline that needs to orchestrate between runs. Both are valid, we should support both scenarios.

git status --porcelain lists the files, but will need some additional bash scripting to remove the markers at the start.

laurentsimon commented 6 months ago

Oh, you're thinking from the point of view of the CLI usage. I was thinking from the point of view of a ML training pipeline that needs to orchestrate between runs. Both are valid, we should support both scenarios.

Can you provide more details about the requirements for this? What are the constraints? How are models stored between runs? Where does https://github.com/sigstore/model-transparency/issues/160#issuecomment-2110785159 fail for this use case?

laurentsimon commented 6 months ago

Oh, you're thinking from the point of view of the CLI usage. I was thinking from the point of view of a ML training pipeline that needs to orchestrate between runs. Both are valid, we should support both scenarios.

git status --porcelain lists the files, but will need some additional bash scripting to remove the markers at the start.

@McPatate how would you imagine an integration look like? Would HF API invoke the git command or use an existing Python git API? Or do you imagine a dedicated CLI for it, separate from HF Python framework API?

mihaimaruseac commented 6 months ago

Can you provide more details about the requirements for this? What are the constraints? How are models stored between runs? Where does #160 (comment) fail for this use case?

We want to have the ML training pipeline code call directly a library. If the pipeline knows that they are going to write files X, Y, Z, they should be able to call the signing library saying that only those files changed.

Using a CLI is a solution, but then it would require a separate process, that some MLOps engineers might not follow. Having this in the pipeline code itself would uplift it everywhere.

laurentsimon commented 6 months ago

Can you provide more details about the requirements for this? What are the constraints? How are models stored between runs? Where does #160 (comment) fail for this use case?

We want to have the ML training pipeline code call directly a library. If the pipeline knows that they are going to write files X, Y, Z, they should be able to call the signing library saying that only those files changed.

I was not thinking in term of CLI vs API (both will be available), but in terms of whether the API interface works for the use case or not. In https://github.com/sigstore/model-transparency/issues/160#issuecomment-2110785159, would the "simple" recompute_files parameter work for this use case?

mihaimaruseac commented 6 months ago

Oh, I misunderstood, given the git scenario to list the files that have a diff. But, I see upsides and downsides to either approach.

With an explicit list of files that are changed and deleted we don't need to scan the model tree and compare with a list, but we can miss some modifications if the library users don't pass the file in the proper list. And we'll have to define the error scenarios.

With a recompute_files we need to scan the directory tree and then we need to compare with the entire manifest (i.e., traversing every key in a dictionary) to compute the 2 lists above. We guarantee completeness but we would have a slowdown. But this would likely be simpler when doing a git show command to list the files and then piping that list into a CLI to update hashes.

laurentsimon commented 6 months ago

With a recompute_files we need to scan the directory tree and then we need to compare with the entire manifest (i.e., traversing every key in a dictionary) to compute the 2 lists above. We guarantee completeness but we would have a slowdown. But this would likely be simpler when doing a git show command to list the files and then piping that list into a CLI to update hashes.

yeah I think it's simple and the slowdown is likely negligible: even 100's files on disk is fairly fast to list them and compare them. It's also easier for the caller to not make mistakes.

McPatate commented 6 months ago

There exists repos with ungodly file structures, but let's not consider them, they are pretty rare 😄

Even https://huggingface.co/datasets/HuggingFaceFW/fineweb "only" has a couple thousand files.

@McPatate how would you imagine an integration look like? Would HF API invoke the git command or use an existing Python git API? Or do you imagine a dedicated CLI for it, separate from HF Python framework API?

I'm not exactly sure how huggingface_hub computes its diffs, but I believe it has the information of what changed in the repo (cf _commit_api.py). I think the API you chose for re-hashing makes sense. Imo I would go with creating a dedicated CLI as a first step before integrating to huggingface_hub. I'm happy to lend a hand here!

sigstore / model-transparency

API for incremental model re-hashing #160