huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2k stars 528 forks source link

Thread safe when upload files in multiple threads or processes #1422

Open narugo1992 opened 1 year ago

narugo1992 commented 1 year ago

Is your feature request related to a problem? Please describe. Here are 412 errors when I tried to upload images to dataset in multiple parallel runners: https://github.com/narugo1992/gchar/actions/runs/4599743391/jobs/8125512962#step:14:272

This error is most likely caused by conflicts between multiple submissions that occur simultaneously during multi-threaded uploading.

Describe the solution you'd like Regarding this issue, the solution I can think of is: when 412 errors are detected due to multi-threaded uploading, automatic retry can be performed (of course, considering that retrying is not always appropriate in every case, this feature can be made optional).

Describe alternatives you've considered If a specific type of exception can be thrown when such errors occur, and a manual function for refreshing and retrying is provided, then the user can also control whether to retry or not. This solution can also be an alternative to the above-mentioned solution.

Additional context This scenario is common when dealing with a large number of files with a large overall size, and the data needs to be automatically updated based on the timed execution functionality provided by online platforms such as Github Action. Since online runners cannot provide very large hard disk space, resources can only be generated and uploaded at the same time to reduce the cost of hard disk space. Under this circumstance, if stable concurrent uploading or processing (such as deleting) of data in the dataset can be achieved, it will greatly facilitate the deployment of automatic update functionality for the dataset.

Wauplin commented 1 year ago

Hi @narugo1992, thanks for raising the question. I'm not sure yet how/if we want to solve this issue at the moment, given the particular use case you have. In general concurrent writes in git are quite difficult to tackle as they might be changing the same resource and we don't want to loose some information in the process.

In your particular case it seems quite straightforward to do a try/except around the upload_file call and implement the backoff strategy on your side. This gives you flexibility (how many attempts? how long to wait? on which error? ...). A good thing is that if you are committing LFS files, you will not have to re-upload them at each attempt.

# /!\ untested and quite simple implementation
def upload_with_retries(...):
    attempt = 0
    while True:
        attempt += 1
        try:
            return upload_file(...)
        except HfHubHTTPError as error:
            if error.status_code != 412 or attempt >= MAX_ATTEMPTS:
                raise
            time.sleep(1)

Also just so you know, it's possible to upload/delete several files at the same time using the low level create_commit method. That might solve your concurrency issue (you can upload tens of files in the same commit).


Note: I think the more general question you are raising is the ability to update mulitple files concurrently or on the fly, the same way you would deal with a local filesystem or a S3 bucket. This is currently not so easy because of the underlying git-based backend -and will most probably still be the case at short/mid-term. cc @Pierrci @julien-c about recent discussions to "hide" more the git constraints

narugo1992 commented 1 year ago

@Wauplin One bad news is that even after I switched to using a similar retry code as mentioned above (which checks for status code 412), the problem was still not resolved. There is a high probability of encountering a 500 error when running in multi-threaded mode. Please refer to the log for more details:

Wauplin commented 1 year ago

@narugo1992 I'm sorry you're experiencing new errors. Unfortunately the Hub is not designed to be used as a database with lots of (concurrent) updates. What you could do here is to retry the commit on both HTTP 412 and HTTP 500 errors (it's just a matter of adding 1 condition to the snippet above). However, even if you do this you will face other issues quite soon.

For example I saw that this dataset has 12k commits since 48h. In theory we don't have a limitation in the number of commits you can push to a repo. In practice it might become unusable if it has hundreds of thousands of commits (at this speed it's even 1M commits in 5 months). Idk exactly the limitations you are facing when uploading the files but I think it would be good to try to group the uploads in less commits. Is that something possible?

narugo1992 commented 1 year ago

For example I saw that this dataset has 12k commits since 48h. In theory we don't have a limitation in the number of commits you can push to a repo. In practice it might become unusable if it has hundreds of thousands of commits (at this speed it's even 1M commits in 5 months). Idk exactly the limitations you are facing when uploading the files but I think it would be good to try to group the uploads in less commits. Is that something possible?

Thank you for the response. In fact, this dataset is currently in the creation phase and existing data is being continuously uploaded. It is expected that the total number of commits will be about 1.5k after completion, and there may only be a few hundred additional commits each month. The reason for doing this is that Issue #1411 is still unresolved. If we upload everything in a single commit, the disk space of Github Action will be difficult to sustain. On the other hand, if we split and then upload using commit, it will result in empty commits. Therefore, currently we can only upload files one by one.

Wauplin commented 1 year ago

Ah I see so it shouldn't be that bad after all. Having backoff on HTTP 500 should most probable resolve your problem.

About uploading 1 by 1 vs a group, I'm not sure to understand the link. It seems that you're having an helper hf_need_upload. What you can do is for each file you crawl, you check if it should be uploaded and then add it to a queue if it's the case. Once the queue reaches 50 items, you create the commit. It should most probably avoid empty commits isn't it? (or crawlers are really working in parallel and uploading the same files regularly?). If you do a queue of 50 CommitOperationAdd (i.e. 50 files), Github Action disk space should not complain too much.

narugo1992 commented 1 year ago

About uploading 1 by 1 vs a group, I'm not sure to understand the link. It seems that you're having an helper hf_need_upload. What you can do is for each file you crawl, you check if it should be uploaded and then add it to a queue if it's the case. Once the queue reaches 50 items, you create the commit. It should most probably avoid empty commits isn't it? (or crawlers are really working in parallel and uploading the same files regularly?). If you do a queue of 50 CommitOperationAdd (i.e. 50 files), Github Action disk space should not complain too much.

This is indeed a feasible solution, but it requires the development of additional code. Therefore, I would like to inquire as to when issue #1411 will be resolved. If it will take a long time, I will redesign that part. Otherwise, I can directly use the upload_folder function to solve this problem after the issue is resolved. 😄

Wauplin commented 1 year ago

@narugo1992 We'll for sure work on ignoring empty commits but not necessarily on the short term. At least I can't promise when we'll deliver it. Sorry about that, I hope you'll be able to implement the workaround :)

narugo1992 commented 1 year ago

@narugo1992 We'll for sure work on ignoring empty commits but not necessarily on the short term. At least I can't promise when we'll deliver it. Sorry about that, I hope you'll be able to implement the workaround :)

Ok, fine, I will manually wrap the operations on repo. :)