Open stas00 opened 3 years ago
Yes, I think that can help, we have similar issues in the huggingface_hub
repository.
I'm wondering if these issues don't come from the fact that these tests are very quick to run, therefore bursting the server which has issues handling all requests. It also happens with tokenizers which also run fast, but not with models.
If that's the case then a time.sleep(3)
would work, but spreading the tests so that they're not run sequentially could also work.
cc @julien-c
From what I'm observing this issue doesn't happen anymore - should we close the issue and reopen if the network failures reappear at some point?
Sounds good, @LysandreJik
OK, it's happening again,
2021-09-28T00:56:00.8216138Z 502 Server Error: Bad Gateway for url: https://huggingface.co/patrickvonplaten/t5-tiny-random/resolve/main/config.json
2021-09-28T00:56:00.8217204Z ___________ TestDeepSpeedWithLauncher.test_do_eval_no_train_1_zero3 ____________
Our deepspeed integration tests are now integrated into the Deepspeed core CI and they report these failures.
You can see other HF projects reporting this issue as well: e.g. see this thread: https://huggingface.slack.com/archives/C01BWJU0YKW/p1632819750394400
I wonder if we should somehow have a way not only to retry the download but gracefully recover and most lilkely having a special setting in our test suite that when network failure occurs despite the retries the test skips rather than fails - we won't use that on our CI but for external use it'd be important not to interfere with their testing needs.
Reopening this since this is a problem.
e.g. our deepspeed section of tests run on the Deepspeed CI intermittently fails to fetch files from the hub.
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url:
https://huggingface.co/sshleifer/tiny-gpt2/resolve/main/config.json
which impacts their CI.
I think perhaps we need a retry mechanism in the core of the network fetches and not put the burden on the tests.
@LysandreJik
Yes! How would you like to tackle this? With a retry on each test, with a small pause? I wonder how best to handle it, given that chaining that test with no pause would probably result in the same issue happening over and over again, repeatedly, while putting a pause might significantly slow the test suite down.
Do you have any ideas regarding how to solve this best?
I believe the retry mechanism should be part of the download API, since that's the unreliable link in the chain.
I propose to have new arguments in the download API with sensible defaults:
try_times=3
- how many times to try before giving uptry_sleep_secs=1
- how long to sleep between trying againWith these defaults the longest delay is 2 seconds, which is probably not an issue for the test suite. Especially if we cache downloads.
If it can't download after 3 tries then if the client is OK then the server is at fault and it needs a higher capacity/scalability to handle a high request rate.
That sounds good, even if I'm a bit afraid that retrying in succession won't solve much. When a test fails for server error, then usually other tests fail. I'm still open to trying it out to see if it improves these errors!
Would you like to give it a try? I'm guessing only this method needs to be modified: https://github.com/huggingface/transformers/blob/efea0f868bd381244e3cef51b388293e41a36d1e/src/transformers/file_utils.py#L1594
cc @julien-c as this is a safeguard against the server's instabilities.
BTW @LysandreJik i think we should soon switch from file_utils
to huggingface_hub
no?
none of this is really transformers-specific?
Indeed, some of the logic could be upstreamed in huggingface_hub
(was pushing this back as I'm a fervent believer of "if it ain't broke, don't fix it", especially for such a core component of the library which doesn't need to evolve much)
yes, same feeling. However i think we should try to prevent the two codebases from diverging too much since initially the code was extracted from transformers anyways
(failure retry is an example of a quite big change, for instance)
Maybe if we do this, an option would be to upstream the same change to huggingface_hub then?
Yes, that sounds like a good plan. We've started moving some methods (like HfApi
) to huggingface_hub
anyway, so for iso-behavior methods, I'm fine to move them in huggingface_hub
sooner rather than later.
Let's go with the retry option first in transformers
, and I'll take the opportunity to upstream it in huggingface_hub
once we have settled on the logic and it is merged in transformers
.
As @sgugger mentions offline, this issue also appears in the push to hub methods (403 errors, 500 errors), so maybe adding a retry option there for testing would make sense as well
That sounds good, even if I'm a bit afraid that retrying in succession won't solve much. When a test fails for server error, then usually other tests fail. I'm still open to trying it out to see if it improves these errors!
Should these incidents (repetitive failures) be also logged or does someone review server logs to ensure that these failures aren't indicative of an issue with the server?
We need to have a clear distinction between a failure due to network transport issues vs server's inability to cope with the traffic. If the server is overloaded, then of course re-trying won't help. But then we need to fix the server not to be overloaded.
FWIW, this issue continues on our CI:
ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.16.1/metrics/sacrebleu/sacrebleu.py
Do you have a link for ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.16.1/metrics/sacrebleu/sacrebleu.py
?
cc @lhoestq
Oh, it was just giving an example of an intermittent failure on our CI. It was fine when CI restarted.
So with re-try it could have been avoided. Since all other files were fetched or looked up just fine.
Hi ! If it can help, note that in datasets
we've already added a retry mechanism in file_utils.py
@lhoestq, I didn't follow all the paths, but it appears that max_retries
is either 0 or 1 almost everywhere in datasets
unless the user overrides it. Unless you believe a single retry is sufficient.
But, yes, this is what we want in transformers! Thank you!
Additionally, I'm thinking this. Most of the time on set ups like CI or a developer's box most of the datasets and transformers files have already been cached.
Would it make sense to check that if
to simply switch to using a local file and tell the user that this was done?
I believe this is an even more realistic use case and will 10-100x reduce the amount of failures due to network issues.
If you agree I'd use the following algorithm:
with each step being needed only if the previous fails.
Here is another example of CI intermittent failure which could have been re-tried and not fail the whole CI:
E requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://huggingface.co/api/models/facebook/mbart-large-50-one-to-many-mmt
I'm all for a retry mechanism, especially given the recent errors we've been seeing in the CI.
Regarding the fetched files, I'd rather we keep it the same as it is right now: we have a local_files_only
keyword that enables fetching from the local folder. With this argument, we have this option as an opt-in, rather than as a behind-the-scenes method, which I think is important.
Otherwise, the user might use from_pretrained
to fetch the latest version of a repository, and the version fetched could actually be the latest they have on their setup, which is a small (but potentially impactful) breaking change.
~Would you have some bandwidth to implement the retry mechanism?~ I should have a bit of time to tackle this by the time I'm back from holidays.
We can't use local_files_only
on CI since then we will miss updated remote data.
I agree with your discussion of the user-mode.
Here are a few more extensions to my proposal:
we can differentiate between CI-mode and user-mode. in CI-mode (env var activated) we can use the algo suggested in https://github.com/huggingface/transformers/issues/12061#issuecomment-987448258
In parallel I think there is a difference when we get a 50x and 40x response. Regardless of CI-mode or not, a 40x is a client error and should not try to use a local cache. 50x is a server error and thus a local cache should be used.
With the caveat for non-public repos where codes are obscure not to expose the private repo layout and the un-authenticated user always get 40x regardless of true path, but I think this falls neatly into the 40x group anyway - a client error.
So here an updated algo:
# in this algo a successful "fetch the file from online or cache" exits the algo.
If env["TRANSFORMERS_CI_CACHE_RETRY_ON_500"]:
1. try to fetch the file
2. if 50x: look up local cache
else: fail
3. if not in cache: sleep and retry to fetch the file
4. if 50x: sleep and retry to fetch the file
else: fail
5. assert with: bailing after re-tried 3 times and no local version found cached
else: # normal mode
1. try to fetch the file
2. do nothing
3. if 50x: sleep and retry to fetch the file
else: fail
4. if 50x: sleep and retry to fetch the file
else: fail
5. assert with: bailing after re-tried 3 times
The 2 halves are almost the same with the only addition of cache lookup in the CI-mode for step 2. Hence the do nothing 2nd step in the 2nd half.
What do you think?
and of course the same should apply to datasets
and transformers
Thank you for handling the github bot - would love to make time for this this or next week.
We have a group of tests that require a reliable network, which is never 100% so they fail for many months.
I propose that those tests will be rewritten with unstable network in mind and include:
time.sleep(3)
e.g. one of the candidates is:
tests/test_hf_api.py::HfApiEndpointsTest::test_list_repos_objs
but also recent tests that push to hub.
Perhaps a simple retry context manager can be added to
testing_utils.py
, which would trap exceptions and retry after a pause. And then simply wrap the content of existing tests into that context manager, e.g.:it could accept the number of retries and sleep time between retries for optional arguments.
Of course, it's probably even better to make it also a decorator. e.g.
@unreliable_network_retry
@LysandreJik