[testing] making network tests more reliable

stas00 commented 3 years ago

We have a group of tests that require a reliable network, which is never 100% so they fail for many months.

I propose that those tests will be rewritten with unstable network in mind and include:

time.sleep(3)
retry 3-5 times

e.g. one of the candidates is:

tests/test_hf_api.py::HfApiEndpointsTest::test_list_repos_objs

but also recent tests that push to hub.

Perhaps a simple retry context manager can be added to testing_utils.py, which would trap exceptions and retry after a pause. And then simply wrap the content of existing tests into that context manager, e.g.:

with RetryAfterSleepTest():
    # normal test code

it could accept the number of retries and sleep time between retries for optional arguments.

Of course, it's probably even better to make it also a decorator. e.g. @unreliable_network_retry

@LysandreJik

LysandreJik commented 3 years ago

Yes, I think that can help, we have similar issues in the huggingface_hub repository.

I'm wondering if these issues don't come from the fact that these tests are very quick to run, therefore bursting the server which has issues handling all requests. It also happens with tokenizers which also run fast, but not with models.

If that's the case then a time.sleep(3) would work, but spreading the tests so that they're not run sequentially could also work.

cc @julien-c

LysandreJik commented 3 years ago

From what I'm observing this issue doesn't happen anymore - should we close the issue and reopen if the network failures reappear at some point?

stas00 commented 3 years ago

Sounds good, @LysandreJik

stas00 commented 3 years ago

OK, it's happening again,

2021-09-28T00:56:00.8216138Z 502 Server Error: Bad Gateway for url: https://huggingface.co/patrickvonplaten/t5-tiny-random/resolve/main/config.json
2021-09-28T00:56:00.8217204Z ___________ TestDeepSpeedWithLauncher.test_do_eval_no_train_1_zero3 ____________

Our deepspeed integration tests are now integrated into the Deepspeed core CI and they report these failures.

You can see other HF projects reporting this issue as well: e.g. see this thread: https://huggingface.slack.com/archives/C01BWJU0YKW/p1632819750394400

I wonder if we should somehow have a way not only to retry the download but gracefully recover and most lilkely having a special setting in our test suite that when network failure occurs despite the retries the test skips rather than fails - we won't use that on our CI but for external use it'd be important not to interfere with their testing needs.

stas00 commented 3 years ago

Reopening this since this is a problem.

e.g. our deepspeed section of tests run on the Deepspeed CI intermittently fails to fetch files from the hub.

requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: 
https://huggingface.co/sshleifer/tiny-gpt2/resolve/main/config.json

which impacts their CI.

I think perhaps we need a retry mechanism in the core of the network fetches and not put the burden on the tests.

@LysandreJik

LysandreJik commented 3 years ago

Yes! How would you like to tackle this? With a retry on each test, with a small pause? I wonder how best to handle it, given that chaining that test with no pause would probably result in the same issue happening over and over again, repeatedly, while putting a pause might significantly slow the test suite down.

Do you have any ideas regarding how to solve this best?

stas00 commented 3 years ago

I believe the retry mechanism should be part of the download API, since that's the unreliable link in the chain.

I propose to have new arguments in the download API with sensible defaults:

try_times=3 - how many times to try before giving up
try_sleep_secs=1 - how long to sleep between trying again

With these defaults the longest delay is 2 seconds, which is probably not an issue for the test suite. Especially if we cache downloads.

If it can't download after 3 tries then if the client is OK then the server is at fault and it needs a higher capacity/scalability to handle a high request rate.

LysandreJik commented 3 years ago

That sounds good, even if I'm a bit afraid that retrying in succession won't solve much. When a test fails for server error, then usually other tests fail. I'm still open to trying it out to see if it improves these errors!

Would you like to give it a try? I'm guessing only this method needs to be modified: https://github.com/huggingface/transformers/blob/efea0f868bd381244e3cef51b388293e41a36d1e/src/transformers/file_utils.py#L1594

cc @julien-c as this is a safeguard against the server's instabilities.

julien-c commented 3 years ago

BTW @LysandreJik i think we should soon switch from file_utils to huggingface_hub no?

none of this is really transformers-specific?

LysandreJik commented 3 years ago

Indeed, some of the logic could be upstreamed in huggingface_hub (was pushing this back as I'm a fervent believer of "if it ain't broke, don't fix it", especially for such a core component of the library which doesn't need to evolve much)

julien-c commented 3 years ago

yes, same feeling. However i think we should try to prevent the two codebases from diverging too much since initially the code was extracted from transformers anyways

(failure retry is an example of a quite big change, for instance)

Maybe if we do this, an option would be to upstream the same change to huggingface_hub then?

LysandreJik commented 3 years ago

Yes, that sounds like a good plan. We've started moving some methods (like HfApi) to huggingface_hub anyway, so for iso-behavior methods, I'm fine to move them in huggingface_hub sooner rather than later.

Let's go with the retry option first in transformers, and I'll take the opportunity to upstream it in huggingface_hub once we have settled on the logic and it is merged in transformers.

LysandreJik commented 3 years ago

As @sgugger mentions offline, this issue also appears in the push to hub methods (403 errors, 500 errors), so maybe adding a retry option there for testing would make sense as well

stas00 commented 3 years ago

That sounds good, even if I'm a bit afraid that retrying in succession won't solve much. When a test fails for server error, then usually other tests fail. I'm still open to trying it out to see if it improves these errors!

Should these incidents (repetitive failures) be also logged or does someone review server logs to ensure that these failures aren't indicative of an issue with the server?

We need to have a clear distinction between a failure due to network transport issues vs server's inability to cope with the traffic. If the server is overloaded, then of course re-trying won't help. But then we need to fix the server not to be overloaded.

stas00 commented 3 years ago

FWIW, this issue continues on our CI:

ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.16.1/metrics/sacrebleu/sacrebleu.py

LysandreJik commented 3 years ago

Do you have a link for ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.16.1/metrics/sacrebleu/sacrebleu.py ?

cc @lhoestq

stas00 commented 3 years ago

Oh, it was just giving an example of an intermittent failure on our CI. It was fine when CI restarted.

So with re-try it could have been avoided. Since all other files were fetched or looked up just fine.

lhoestq commented 3 years ago

Hi ! If it can help, note that in datasets we've already added a retry mechanism in file_utils.py

stas00 commented 3 years ago

@lhoestq, I didn't follow all the paths, but it appears that max_retries is either 0 or 1 almost everywhere in datasets unless the user overrides it. Unless you believe a single retry is sufficient.

But, yes, this is what we want in transformers! Thank you!

stas00 commented 3 years ago

Additionally, I'm thinking this. Most of the time on set ups like CI or a developer's box most of the datasets and transformers files have already been cached.

Would it make sense to check that if

there is a failure to fetch a model or a dataset or a support file
and there is already a cached version of the same

to simply switch to using a local file and tell the user that this was done?

I believe this is an even more realistic use case and will 10-100x reduce the amount of failures due to network issues.

If you agree I'd use the following algorithm:

try to fetch the file
look up local cache
retry to fetch the file
retry to fetch the file
assert with: bailing after re-tried 3 times and no local version found cached

with each step being needed only if the previous fails.

stas00 commented 3 years ago

Here is another example of CI intermittent failure which could have been re-tried and not fail the whole CI:

E           requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://huggingface.co/api/models/facebook/mbart-large-50-one-to-many-mmt

Source: https://app.circleci.com/pipelines/github/huggingface/transformers/31002/workflows/327de938-0361-420e-abb5-c35d45bca5bb/jobs/318450

LysandreJik commented 2 years ago

I'm all for a retry mechanism, especially given the recent errors we've been seeing in the CI.

Regarding the fetched files, I'd rather we keep it the same as it is right now: we have a local_files_only keyword that enables fetching from the local folder. With this argument, we have this option as an opt-in, rather than as a behind-the-scenes method, which I think is important.

Otherwise, the user might use from_pretrained to fetch the latest version of a repository, and the version fetched could actually be the latest they have on their setup, which is a small (but potentially impactful) breaking change.

~Would you have some bandwidth to implement the retry mechanism?~ I should have a bit of time to tackle this by the time I'm back from holidays.

stas00 commented 2 years ago

We can't use local_files_only on CI since then we will miss updated remote data.

I agree with your discussion of the user-mode.

Here are a few more extensions to my proposal:

we can differentiate between CI-mode and user-mode. in CI-mode (env var activated) we can use the algo suggested in https://github.com/huggingface/transformers/issues/12061#issuecomment-987448258
In parallel I think there is a difference when we get a 50x and 40x response. Regardless of CI-mode or not, a 40x is a client error and should not try to use a local cache. 50x is a server error and thus a local cache should be used.

With the caveat for non-public repos where codes are obscure not to expose the private repo layout and the un-authenticated user always get 40x regardless of true path, but I think this falls neatly into the 40x group anyway - a client error.

So here an updated algo:


# in this algo a successful "fetch the file from online or cache" exits the algo.

If env["TRANSFORMERS_CI_CACHE_RETRY_ON_500"]:

    1. try to fetch the file 
    2. if 50x: look up local cache
       else: fail
    3. if not in cache: sleep and retry to fetch the file
    4. if 50x: sleep and retry to fetch the file
       else: fail
    5. assert with: bailing after re-tried 3 times and no local version found cached

else: # normal mode

    1. try to fetch the file
    2. do nothing
    3. if 50x: sleep and retry to fetch the file
       else: fail
    4. if 50x: sleep and retry to fetch the file
       else: fail
    5. assert with: bailing after re-tried 3 times

The 2 halves are almost the same with the only addition of cache lookup in the CI-mode for step 2. Hence the do nothing 2nd step in the 2nd half.

What do you think?

and of course the same should apply to datasets and transformers

LysandreJik commented 2 years ago

Thank you for handling the github bot - would love to make time for this this or next week.

huggingface / transformers

[testing] making network tests more reliable #12061