huggingface / hf-hub

Rust client for the huggingface hub aiming for minimal subset of features over `huggingface-hub` python package
151 stars 64 forks source link

Feat/sync retry download #23

Open benedikt-schaber opened 1 year ago

benedikt-schaber commented 1 year ago

What this PR does

This PR is related to #18. It does not support resuming downloads. Rather it adds retry capabilities to the sync API. It adds the max_retries field to ApiBuilder and Api that is then used in download. We stream the download to a tempfile, and, if we encounter an error retry at most max_retries times, using range requests to only request the data we do not have already. We use a new struct ResumableReader to keep track of the number of bytes we downloaded successfully, it is basically just a reduced version of Progressbar. We assume all bytes that we did receive to be correct.

Left ToDo

Testing

Testing this functionality properly is a bit harder, since we need to ensure the failure of the request. Further, it would be nice to test that we indeed do not continue from the start of the file.

There are different ways we could go about achieving this, here are some I evaluated:

Using a mock server would be simplest, however many existing frameworks, like wiremock or httpmock do not support us returning an ok, in the range case 206, request but then streaming and failing the body if I understood that correctly. So I think we would be forced to implement a proper serve on our own.

Second would be a bit annoying, since we not only use the client, but then also the response (although we just turn that into a Reader), so it would be doable. We of course would not test if the Reader we get from the request truly behaves like we expect it (and like our mock does), but I think this is acceptable.

Third would be much like second although reducing what we need to mock, but also changing some things about the architecture which may or may not be acceptable.

So, this is the main area I am looking to for some feedback, since I think whatever choice is taken, it will have an effect on the rest of the project.

Additionally if we also want to test that it does indeed not just always resume from the start we would also potentially get some issues with std::io::copy since we do not know its buffer size. But that should be rather easy to resolve, so I will look into it once a basic testing strategy is outlined.