huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.06k stars 541 forks source link

Do not create empty commit #1411

Closed HansBug closed 3 months ago

HansBug commented 1 year ago

Is your feature request related to a problem? Please describe. I am using Hugging Face to build a mirrored repository of resources. As the size of the resources is huge, I need to use upload_file and upload_folder to upload resources one by one, in order to avoid filling up the disk space by downloading all the resources at once. However, when some of the resource files are completely identical to the existing version in the Hugging Face repository, the huggingface_hub library still creates an empty commit, which is very detrimental to tracking data changes and results in creating thousands of empty commits every time the data is updated. Therefore, I would like to add an option that when upload_file or upload_folder is executed and there is no substantial content change, no empty commit will be created.

Additional context Here is my repository: https://huggingface.co/HansBug/browser_drivers_mirror/tree/main Here is an empty commit created by huggingface_hub: https://huggingface.co/HansBug/browser_drivers_mirror/commit/22f931ed1acc5dc67a0b43b6d2479320b3ed210a

Alternative solution (maybe) Or, if a new API can be provided to check whether a local file or path is exactly the same as a file or path in the repository, the problem can also be solved.

Wauplin commented 1 year ago

Hi @HansBug, thanks for raising the question. At the moment the huggingface_hub library is not aware that the created commit is empty. One way of doing this would be to compare the sha256 of each local file (that is already computed) with the sha256 of the remote files. The closest you can get is to have the location, commit hash, etag and size of a file by using get_hf_file_metadata. But the sha256 is not returned unfortunately :confused: (FYI, if an LFS file is committed twice, the data is not re-uploaded - the problem is more for "regular" files).

@coyotte508 @Pierrci Do you think it would be possible in the /preupload endpoint (internal link) to also check if the exact same file already exists on the repo? (in addition to returning regular/lfs status). If not I don't think we can prevent empty commits that easily. Refusing empty commits on the /commit endpoint could be a solution but not sure we want to do that (+it would be a breaking change)

@HansBug something you can do for your use case is that each time you do a sync, you create a PR, push all the commits to this PR and then merge it. If there is no diff, you'll get an error and you can just close the PR. If there is a diff, all the small commits will be squashed and you'll get a cleaner git history on the main branch. This is more or less what I recently implementated in https://github.com/huggingface/huggingface_hub/pull/1375 (feedback is welcomed) but it is not merged/released yet. To interact with a PR, you can check this guide.

coyotte508 commented 1 year ago

You can use paths-info (not sure it's implemented in huggingface_hub) to get info on whether some files exist, and their size.

Refusing empty commits on the /commit endpoint could be a solution but not sure we want to do that (+it would be a breaking change)

I think it makes sense to prevent empty commits. Maybe it would need something on the gitaly side cc @XciD @co42 , or a git hook?

Wauplin commented 1 year ago

You can use paths-info to get info on whether some files exist, and their size.

(not implemented in HF) But the issue would be the same. Without getting the sha of the file, we can't know if the file changed or not. Checking the size is a good proxy but not 100% sure.

coyotte508 commented 1 year ago

But the issue would be the same. Without getting the sha of the file, we can't know if the file changed or not. Checking the size is a good proxy but not 100% sure.

We do give the oid of the file, which is the output of git hash-object (not sure about its internal algo)

Wauplin commented 1 year ago

I've done a few investigation and it seems it is possible to recompute the git-hash of a file. It uses sha1 algo + some extra bytes (see https://stackoverflow.com/a/7225329).

Here is a script that compute the git-hash for 2 files + get it from /paths-info and compare them. To be honest I don't know yet if we really want to go into this level of details or not. Could be a good way to check if a file changed but it's quite cumbersome. Also, computing the git-hash of an LFS pointer requires to generate the pointer file which might be error-prone (I got a working example here but don't know if there can be corner cases).

import os
from hashlib import sha1
from huggingface_hub import hf_hub_download
import requests

pytorch_model_raw = b"""version https://git-lfs.github.com/spec/v1
oid sha256:7c5d3f4b8b76583b422fcb9189ad6c89d5d97a094541ce8932dce3ecabde1421
size 548118077
"""

def git_hash(content: bytes) -> str:
    # Inspired by https://stackoverflow.com/a/7225329
    sha = sha1()
    file_size = len(content)
    sha.update(f"blob {file_size}\0".encode("utf-8"))
    sha.update(content)
    return sha.digest().hex()

def get_remote_oid(repo_id: str, filename: str) -> str:
    response = requests.post(
        f"https://huggingface.co/api/models/{repo_id}/paths-info/main",
        json={"paths": [filename]},
    )
    return response.json()[0]["oid"]

print("gpt2", "config.json")
with open(hf_hub_download("gpt2", "config.json"), "rb") as f:
    content = f.read()
print("From Hub:", get_remote_oid("gpt2", "config.json"))
print("Computed:", git_hash(content))

print("gpt2", "pytorch_model.bin")
print("From Hub:", get_remote_oid("gpt2", "pytorch_model.bin"))
print("Computed:", git_hash(pytorch_model_raw))

output:

gpt2 config.json
From Hub: 10c66461e4c109db5a2196bff4bb59be30396ed8
Computed: 10c66461e4c109db5a2196bff4bb59be30396ed8
gpt2 pytorch_model.bin
From Hub: 11f855591b3269c74652daee461f96fc238875ed
Computed: 11f855591b3269c74652daee461f96fc238875ed
HansBug commented 1 year ago

@Wauplin Any documentation of api POST api/models/{repo_id}/paths-info/main? I cannot find it.

Wauplin commented 1 year ago

@HansBug No, there isn't yet unfortunately. I'll add it officially to huggingface_hub in the coming days.

What I can tell you:

julien-c commented 1 year ago

I think it makes sense to prevent empty commits.

There are valid use cases for empty commits though (i use them from time to time, i even have a bash alias for it)

HansBug commented 1 year ago

I've done a few investigation and it seems it is possible to recompute the git-hash of a file. It uses sha1 algo + some extra bytes (see https://stackoverflow.com/a/7225329).

Here is a script that compute the git-hash for 2 files + get it from /paths-info and compare them. To be honest I don't know yet if we really want to go into this level of details or not. Could be a good way to check if a file changed but it's quite cumbersome. Also, computing the git-hash of an LFS pointer requires to generate the pointer file which might be error-prone (I got a working example here but don't know if there can be corner cases).

import os
from hashlib import sha1
from huggingface_hub import hf_hub_download
import requests

pytorch_model_raw = b"""version https://git-lfs.github.com/spec/v1
oid sha256:7c5d3f4b8b76583b422fcb9189ad6c89d5d97a094541ce8932dce3ecabde1421
size 548118077
"""

def git_hash(content: bytes) -> str:
    # Inspired by https://stackoverflow.com/a/7225329
    sha = sha1()
    file_size = len(content)
    sha.update(f"blob {file_size}\0".encode("utf-8"))
    sha.update(content)
    return sha.digest().hex()

def get_remote_oid(repo_id: str, filename: str) -> str:
    response = requests.post(
        f"https://huggingface.co/api/models/{repo_id}/paths-info/main",
        json={"paths": [filename]},
    )
    return response.json()[0]["oid"]

print("gpt2", "config.json")
with open(hf_hub_download("gpt2", "config.json"), "rb") as f:
    content = f.read()
print("From Hub:", get_remote_oid("gpt2", "config.json"))
print("Computed:", git_hash(content))

print("gpt2", "pytorch_model.bin")
print("From Hub:", get_remote_oid("gpt2", "pytorch_model.bin"))
print("Computed:", git_hash(pytorch_model_raw))

output:

gpt2 config.json
From Hub: 10c66461e4c109db5a2196bff4bb59be30396ed8
Computed: 10c66461e4c109db5a2196bff4bb59be30396ed8
gpt2 pytorch_model.bin
From Hub: 11f855591b3269c74652daee461f96fc238875ed
Computed: 11f855591b3269c74652daee461f96fc238875ed

I have tested this code. Perhaps a more effective approach to check if a large local file matches the LFS file on the repository is to obtain the LFS's Object ID (OID) and size, and then compare it with the local file. Below is an example code:

import os
from hashlib import sha256
from typing import Tuple

import requests
from huggingface_hub import hf_hub_download

filename = hf_hub_download(
    'HansBug/browser_drivers_mirror',
    'google/111.0.5563.64/chromedriver_linux64.zip'
)

def get_sha256(file, chunk=1 << 20):
    sha = sha256()
    with open(file, 'rb') as f:
        while True:
            data = f.read(chunk)
            if not data:
                break
            else:
                sha.update(data)

    return sha.hexdigest()

def get_remote_oid_lfs(repo_id: str, filename: str) -> Tuple[str, int]:
    response = requests.post(
        f"https://huggingface.co/api/models/{repo_id}/paths-info/main",
        json={"paths": [filename]},
    )
    lfs_data = response.json()[0]['lfs']
    return lfs_data['oid'], lfs_data['size']

if __name__ == '__main__':
    print((get_sha256(filename), os.path.getsize(filename)))
    print(get_remote_oid_lfs(
        'HansBug/browser_drivers_mirror',
        'google/111.0.5563.64/chromedriver_linux64.zip'
    ))

The output is

('1a293bbc6d1a46f577cd1512e023c90fdbf2688bbf765c2d1c84f2a21b349b38', 7160988)
('1a293bbc6d1a46f577cd1512e023c90fdbf2688bbf765c2d1c84f2a21b349b38', 7160988)
HansBug commented 1 year ago

I think it makes sense to prevent empty commits.

There are valid use cases for empty commits though (i use them from time to time, i even have a bash alias for it)

Therefore, it may be beneficial for the default behaviour to not create an empty commit when the file remains the same. However, if a user desires to create a commit, regardless of whether or not it remains empty, the user can specify the relevant argument within the function. 😃

HansBug commented 1 year ago

Here is my current version in use:

import os
from hashlib import sha256, sha1

import requests
from huggingface_hub import hf_hub_download

def hf_resource_check(local_filename, repo_id: str, file_in_repo: str, repo_type='model', revision='main',
                      chunk_for_hash: int = 1 << 20):
    response = requests.post(
        f"https://huggingface.co/api/{repo_type}s/{repo_id}/paths-info/{revision}",
        json={"paths": [file_in_repo]},
    )
    metadata = response.json()[0]
    if 'lfs' in metadata:
        is_lfs, oid, filesize = True, metadata['lfs']['oid'], metadata['lfs']['size']
    else:
        is_lfs, oid, filesize = False, metadata['oid'], metadata['size']

    if filesize != os.path.getsize(local_filename):
        return False

    if is_lfs:
        sha = sha256()
    else:
        sha = sha1()
        sha.update(f'blob {filesize}\0'.encode('utf-8'))
    with open(local_filename, 'rb') as f:
        # make sure the big files will not cause OOM
        while True:
            data = f.read(chunk_for_hash)
            if not data:
                break
            sha.update(data)

    return sha.hexdigest() == oid

if __name__ == '__main__':
    local_lfs_file = hf_hub_download(
        'HansBug/browser_drivers_mirror',
        'google/111.0.5563.64/chromedriver_linux64.zip'
    )
    local_file = hf_hub_download('HansBug/browser_drivers_mirror', 'README.md')

    # chromedriver_linux64.zip vs chromedriver_linux64.zip
    print(hf_resource_check(
        local_lfs_file,
        'HansBug/browser_drivers_mirror',
        'google/111.0.5563.64/chromedriver_linux64.zip'
    ))
    # README.md vs README.md
    print(hf_resource_check(local_file, 'HansBug/browser_drivers_mirror', 'README.md'))
    # chromedriver_linux64.zip vs README.md
    print(hf_resource_check(local_lfs_file, 'HansBug/browser_drivers_mirror', 'README.md'))
    # README.md vs chromedriver_linux64.zip
    print(hf_resource_check(
        local_file,
        'HansBug/browser_drivers_mirror',
        'google/111.0.5563.64/chromedriver_linux64.zip'
    ))

The output is

True
True
False
False
Wauplin commented 1 year ago

Hey @HansBug, thanks for your snippet! I think we could use it indeed. I'll first implement the /paths-info endpoint and then think to integrate it in the existing create_commit method. I'll also check if it's not deteriorating too much the UX (because of additional calls). I'll let you know when I work on it.

-also preventing empty commits server-side might be shipped before the huggingface_hub update :) -

narugo1992 commented 1 year ago

I identified an issue with the mentioned code related to the function os.path.getsize().

For instance, on both Linux and macOS platforms, the size of the example file example_text.txt that can be found here: https://github.com/narugo1992/hfmirror/blob/main/test/testfile/example_text.txt, is reported as 1503, as indicated in the GitHub actions log here: https://github.com/narugo1992/hfmirror/actions/runs/4685820992/jobs/8303277428#step:14:77.

However, on Windows, the file size returned by os.path.getsize() is 1538, according to the GitHub actions log here: https://github.com/narugo1992/hfmirror/actions/runs/4685820992/jobs/8303277589#step:14:65.

This discrepancy in the file size may cause the value of f"blob {file_size}\0".encode("utf-8") to differ between Windows and Linux platforms, subsequently resulting in a failed file check on Windows. Based on my personal testing so far, the file size on Linux is correct.

PS: This issue seems to occur only on non-LFS text files that contain non-ASCII content on Windows.

PS: I just tried:

  1. os.path.getsize(xxx)
  2. pathlib.Path(xxx).stat().st_size
  3. len(pathlib.Path(xxx).read_bytes())

All the above results are the same on Windows (github action log). Therefore, I believe this can now be recognized as a confirmed cross-platform compatibility issue specific to Windows.

Wauplin commented 1 year ago

Wow, thanks @narugo1992 for spotting this compatibility issue and for letting us know. I still haven't investigated further on how we would like to integrate consistency checks in huggingface_hub to avoid empty commits (except working on the /paths-infos endpoint in https://github.com/huggingface/huggingface_hub/pull/1435).

Overall it's a good reminder that we should be extra careful when implementing any logic strongly tied to git internals.

HansBug commented 1 year ago

@narugo1992 I found that:

So, I guess that, when the project is cloned on a Windows runner, the file is treated as a plain text file. As a result, line separators are \n on Linux and \r\n on Windows, leading to differences in file sizes.

However, on the other hand, when uploading a text file to Huggingface, its line breaks will be ignored again or rather treated as \n on the Linux-based server. Therefore, on Windows, using the function mentioned above will result in a difference in file size between the local version of the same file and the version on Huggingface, which is exactly equal to the number of line breaks.

This means that this issue will only occur when a non-LFS text file is uploaded and uses \r\n as the line break.

narugo1992 commented 1 year ago

@Wauplin Can you provide additional information about this new API? I received a temporary 405 error when trying to access https://huggingface.co/api/datasets/deepghs/few_shots/paths-info/main . As this error seems to be not reproducible, would it be possible to provide more information on when it may occur?

Wauplin commented 1 year ago

@narugo1992 /paths-info endpoint is a POST route. I have implemented a wrapper (not yet merged) in huggingface_hub (see https://github.com/huggingface/huggingface_hub/pull/1435). You can have a look at the implementation here. Not much to say about it. URL is in the form "{self.endpoint}/api/{repo_type}s/{repo_id}/paths-info/{revision}" (so yours was correct). It takes as body a "paths" list and optionally expand=True if you want more information about the files. Paths can only be file paths. Paths that do not exists are ignored without raising an exception.

narugo1992 commented 1 year ago

@narugo1992 /paths-info endpoint is a POST route. I have implemented a wrapper (not yet merged) in huggingface_hub (see #1435). You can have a look at the implementation here. Not much to say about it. URL is in the form "{self.endpoint}/api/{repo_type}s/{repo_id}/paths-info/{revision}" (so yours was correct). It takes as body a "paths" list and optionally expand=True if you want more information about the files. Paths can only be file paths. Paths that do not exists are ignored without raising an exception.

Thank you very much for this information. It is highly appreciated and valuable to us.

I recently implemented a library called narugo1992/hfmirror based on the interface mentioned in this issue, which can integrate and synchronize resources from the network (such as Github release, danbooru, or any other custom resources) to the Hugging Face repository according to custom rules. It can be configured and mounted on platforms like Github Action to achieve automatic synchronization similar to a mirror site, allowing datasets to be updated in the long term. It includes a batch upload and delete method (link) for repository files that does not create an empty commit when there are no substantive changes. However, there is still a problem with the Windows platform, as described in this issue: https://github.com/huggingface/huggingface_hub/issues/1411#issuecomment-1506599837 .

Currently, hfmirror has been released as version 0.0.3 and has been deployed on Github Action for multiple different synchronization tasks. It is running stable and greatly alleviates the problem of excessive commits mentioned in https://github.com/huggingface/huggingface_hub/issues/1422. The reliability of this implementation has been proved, and I hope it can be helpful to you. :smile:

Wauplin commented 1 year ago

Wow, that's an impressive library. Thanks for letting me know! Would you mind if maybe we integrate some parts in the official huggingface_hub at some point? I need to dive more in the code but the logic to check if a file already exists and has changed (with paths-info + sha1) is really great :smiley:

~Implementation-wise, I saw that you have huggingface_hub required as a dependency but I haven't seen any use of it. Is there a reason for that?~ -> My bad, it's used in huggingface.py of course. Github search wasn't working apparently :grimacing:

narugo1992 commented 1 year ago

@Wauplin No problem, I'm happy to contribute my code to the huggingface repository. Regarding the paths-info part, it has been verified to be reliable through actual usage, but the problem of newline characters on Windows still needs to be resolved. I'm not sure about Git's rules for judging text files (theoretically, there may be non-LFS binary files), so currently, I don't know how to solve this problem. Directly replacing \r\n may bring unknown risks. The optimal solution to this problem may be based on the huggingface API, at least the interface needs to provide the text/binary type of the file.

Regarding the use of huggingface, my understanding is:

  1. For downloading models in AI application scenarios, hf_hub_download is quite perfect. But this function is not suitable for other cases, especially when files need to be downloaded to a specified location and not maintained in ~/.cache/huggingface. (For a large number of scattered files being downloaded, this means that the system hard disk space will be quickly and meaningless consumed, which is especially fatal for Windows systems because the C drive space is always tight.) Therefore, in such cases, I tend to choose to use hf_hub_url to get the download link and download the files through self-written functions.
  2. In the use of hfmirror, a convenient method is needed to verify whether a file exists in the huggingface repository, but currently no official method is provided for it. Therefore, I use the HEAD request to access the resource link, and 2xx or 304 indicates that the file already exists.
  3. Since the use of hfmirror is often long-running and unattended, and the resources crawled from the original website may not always be connected steadily (huggingface is no exception, and sometimes it may unexpectedly error in tens of thousands of accesses), we have designed the srequest method to implement necessary reconnection when the resource access fails, ensuring that the operation on Github Action will not be unexpectedly interrupted.

I think:

  1. If huggingface could provide a more convenient create_commit method and avoid creating empty commits, then many direct API accesses in hfmirror would no longer be necessary.
  2. Method like HfApi.is_file_exist is important. :laughing:
Wauplin commented 1 year ago

Hey @narugo1992 , I'm getting back to you and will try to be exhaustive regarding your message above. Some problem can be mitigated with the existing tools in huggingface_hub. The biggest feature that is missing is a way to correctly compute the sha1 of a file + prevent empty commits.

1.

it has been verified to be reliable through actual usage, but the problem of newline characters on Windows still needs to be resolved. I'm not sure about Git's rules for judging text files (theoretically, there may be non-LFS binary files), so currently, I don't know how to solve this problem. Directly replacing \r\n may bring unknown risks

Agree with you that we should be extra careful here. Replacing \r\n seems too risky. I can't promise yet how we will tackle this unfortunately.

2.

But this function is not suitable for other cases, especially when files need to be downloaded to a specified location and not maintained in ~/.cache/huggingface

Good news, this is now possible! In the newest version of huggingface_hub you can download files to a specific folder (see guide). It is not perfect though as you have a trade-off between using symlinks (=> keeping actual files in cache) and moving files outside of cache (=> file version is not kept, meaning reusing hf_hub_download will redownload the file even if it's up to date).

3.

this means that the system hard disk space will be quickly and meaningless consumed, which is especially fatal for Windows systems because the C drive space is always tight

In this case what I would advice you to do is to configure where you want to have the cache on your machine. You can set the HF_HOME environment variable to globally move everything "hf-related" to a specific folder (defaults to "~/.cache/huggingface") or set the HUGGINGFACE_HUB_CACHE to define where the cache should be (defaults to "$HF_HOME/hub")

4.

In the use of hfmirror, a convenient method is needed to verify whether a file exists in the huggingface repository, but currently no official method is provided for it. Therefore, I use the HEAD request to access the resource link, and 2xx or 304 indicates that the file already exists.

Method like HfApi.is_file_exist is important.

No official direct method indeed but there is a custom error EntryNotFoundError you can catch. What you can do is:

try:
    # basically a HEAD call, handling redirection and raising custom errors
    # see https://huggingface.co/docs/huggingface_hub/v0.14.1/en/package_reference/file_download#huggingface_hub.get_hf_file_metadata
    _ = get_hf_file_metadata(hf_hub_url(...))
    # file exists
except huggingface_hub.utils.EntryNotFoundError:
    # file does not exist
    ...

If you use hf_hub_download, that translates to:

try:
    hf_hub_download(repo_id, filename, local_dir="path/to/local")
    # file exists
except huggingface_hub.utils.EntryNotFoundError:
    # file does not exist
    ...

5.

and 2xx or 304 indicates that the file already exists.

FYI, you can also use huggingface_hub.utils.hf_raise_for_status(response) instead of response.raise_for_status() when you do custom requests. It basically does the same but raise custom errors tailored for HF use case. Check this guide for more details.

6.

we have designed the srequest method to implement necessary reconnection when the resource access fails, ensuring that the operation on Github Action will not be unexpectedly interrupted.

Yes I saw that and it makes perfect sense IMO! Just to let you know, in the latest version released this week, I have added a way to provide huggingface_hub a custom Session object using configure_http_backend. This is quite a niche use cache so not very encouraged in the docs, but still possible!

import requests
from huggingface_hub import configure_http_backend, get_session

# Create a factory function that returns a Session with configured proxies
def backend_factory() -> requests.Session:
    session = requests.Session()
    (...) # code from https://github.com/narugo1992/hfmirror/blob/main/hfmirror/utils/session.py#L26
    return session

# Set it as the default session factory 
configure_http_backend(backend_factory=backend_factory)

7.

If huggingface could provide a more convenient create_commit method and avoid creating empty commits

Yes definitely but will not be done short term I think (not in the next 2-3 weeks) as my plate is a bit full at the moment :confused:

narugo1992 commented 10 months ago

If you use hf_hub_download, that translates to:

try: hf_hub_download(repo_id, filename, local_dir="path/to/local")

file exists

except huggingface_hub.utils.EntryNotFoundError:

file does not exist

...

@Wauplin Does it have a method like hf_hub_download(repo_id, filename, local_file='xxxx')? I read the documentation and source code of hf_hub_download function, when local_dir is assigned, there always some subdirectories created (like the structure in repo) under the local_dir.

This is very inconvenient for some customized download services. So maybe a simpler local_file can be used?

Wauplin commented 9 months ago

Does it have a method like hf_hub_download(repo_id, filename, local_file='xxxx')?

Not at the moment no. As you saw, you can only provide a directory but the remote repo architecture is also kept the same. What you can do is to download to a tmp directory and then move the file wherever you want on your computer. Keeping the repo structure is made on purpose and I don't think we want to move away from it. Having a new local_file argument would add complexity to the API (that is already complex enough to document/apprehend with cache_dir/local_dir/local_dir_use_symlinks parameters).

Wauplin commented 3 months ago

Here is a PR to (finally) address this: https://github.com/huggingface/huggingface_hub/pull/2389. We rely on the remote git-hash-object for regular files and the remote sha256 for LFS files to determine whether the files have changed or not. In case of empty commit we warn the user and skip the commit.