Occasional NotFoundError failures

gsnedders commented 5 years ago

From https://github.com/web-platform-tests/wpt/issues/13274:

These are stacks like:

Traceback (most recent call last):
  File "./wpt", line 5, in <module>
    wpt.main()
  File "/home/test/web-platform-tests/tools/wpt/wpt.py", line 129, in main
    rv = script(*args, **kwargs)
  File "/home/test/web-platform-tests/tools/wpt/run.py", line 510, in run
    **kwargs)
  File "/home/test/web-platform-tests/tools/wpt/run.py", line 488, in setup_wptrunner
    kwargs["binary"] = setup_cls.install(venv, channel=channel)
  File "/home/test/web-platform-tests/tools/wpt/run.py", line 165, in install
    return self.browser.install(venv.path, channel)
  File "/home/test/web-platform-tests/tools/wpt/browser.py", line 134, in install
    destination=dest).download()
  File "/home/test/web-platform-tests/_venv/lib/python2.7/site-packages/mozdownload/factory.py", line 121, in __init__
    scraper_types[scraper_type].__init__(self, **kwargs)
  File "/home/test/web-platform-tests/_venv/lib/python2.7/site-packages/mozdownload/scraper.py", line 346, in __init__
    Scraper.__init__(self, *args, **kwargs)
  File "/home/test/web-platform-tests/_venv/lib/python2.7/site-packages/mozdownload/scraper.py", line 135, in __init__
    self._retry_check_404(self.get_build_info)
  File "/home/test/web-platform-tests/_venv/lib/python2.7/site-packages/mozdownload/scraper.py", line 150, in _retry_check_404
    self._retry(func, **retry_kwargs)
  File "/home/test/web-platform-tests/_venv/lib/python2.7/site-packages/mozdownload/scraper.py", line 141, in _retry
    return redo.retry(func, **retry_kwargs)
  File "/home/test/web-platform-tests/_venv/lib/python2.7/site-packages/redo/__init__.py", line 162, in retry
    return action(*args, **kwargs)
  File "/home/test/web-platform-tests/_venv/lib/python2.7/site-packages/mozdownload/scraper.py", line 403, in get_build_info
    self.date, self.build_index)
  File "/home/test/web-platform-tests/_venv/lib/python2.7/site-packages/mozdownload/scraper.py", line 484, in get_build_info_for_date
    raise errors.NotFoundError(message, url)
mozdownload.errors.NotFoundError: Folder for builds on 2018-09-28-22-04-33 has not been found: https://archive.mozilla.org/pub/firefox/nightly/2018/09/

From @jgraham in https://github.com/web-platform-tests/wpt/issues/13274#issuecomment-427981381:

I fairly strongly suspect that this is happening when a new nightly is being released (maybe some platforms are available and some are not?). But we are already handling that badly; it's possible to end up with some tests run in the previous nightly and some in the new one. Really we need a single decsion task that picks a binary URL and makes it available to the subsequent tasks to ensure that they all run against the exact same version. Note that Chrome could have the same issue, but it's less likely since the releases are less often. But it's harder to solve in that case; we probably actually need to download the .deb and make it available as an artifact since there isn't a longlived URL AFAIK.

whimboo commented 5 years ago

Yes, this is most likely the case. If the build is not present it will fail. But that is not a problem with mozdownload.

jgraham commented 5 years ago

It seems like a problem with mozdownload if we are asking "give me the latest linux nightly", it's finding a directory where a linux nightly could be, and because it hasn't been uploaded yet, failing rather than going back to the previous nightly. If we are just using the API wrong that's fine (and suggestions/patches welcome), but otherwise this seems like a real issue.

gsnedders commented 5 years ago

Specifically, we're doing:

from mozdownload import FactoryScraper
scraper = FactoryScraper("daily",
     branch="mozilla-central",
     version="latest",
     destination="browsers/nightly")
filename = scraper.download()

At least to me, that looks like it shouldn't ever fail because the build isn't present: either there's some bug on the server side which causes mozdownload to think a build exists, or there's a bug in mozdownload not handling some bit of server behaviour.

whimboo commented 5 years ago

Oh, I see. You don't use the buildid to specify a particular fixed version. In that case this would need some further investigation. I assume you don't have way to run mozdownload with -vv to get more verbose logging output?

So @jgraham's reply would make sense. Looks like we don't traverse back into older folders until a version has been found. I wished that we would already have Taskcluster support. Maybe we should raise its severity to make things like that easier.

whimboo commented 5 years ago

How often do you hit that?

jgraham commented 5 years ago

On a timescale of days i.e. a few times a week. It's often enough that the behaviour is problematic.

whimboo commented 5 years ago

So the code clearly finds the build status file of the latest build. But I wonder if new files first get populated in latest-mozilla-central before an appropriate folder by date is created under the month folder. That is the only thing I could imagine happening here.

Does it help if you add the --retry-attempts, and maybe also the --retry-sleeptime arguments? As I can see you don't make use of that feature yet, and it might help here.

I'm not sure if this is something we can fix by using archive.mozilla.org if the upload of builds is done in a wrong order. Ideally we should use TaskCluster to download the builds, but that hasn't been started yet. See issue #365.

jgraham commented 5 years ago

I assumed that the problem is the latest-mozilla-central link being updated when the first artifact is available for the new set of builds, not the last artifact. So if e.g. there's a linux32-debug build ready but not linux64-opt then latest-mozilla-central will temporarily point at a folder with no suitable build. In that case there's no fallback to just using the latest build that is ready.

Looking at http://archive.mozilla.org/pub/firefox/nightly/latest-mozilla-central/ I see a time gap of at least an hour between the first artifact and the last one (ignoring the non-67 builds), so this at least seems plausible. Given that, retries would help but the total timeout would have to be prohibitively long to avoid hitting this problem.

whimboo commented 5 years ago

I don't think that we completely replace this folder. In such a case there wouldn't be such old Firefox 66 nightly builds present, as what we currently have.

What mozdownload actually does is to check for the status file like: http://archive.mozilla.org/pub/firefox/nightly/latest-mozilla-central/firefox-67.0a1.en-US.linux-x86_64.txt

Currently it lists 20190130215539 as build id. Based on that information we are trying to download the files from https://archive.mozilla.org/pub/firefox/nightly/2019/01/2019-01-30-21-55-39-mozilla-central/. But if the files aren't present there it will fail.

So I really wonder if we maybe first update the links in the latest folder before actually adding/updating the build id specific folder.

@nthomas-mozilla who from RelEng could explain us how the latest-mozilla-central folder and the specific build id folder are getting populated?

nthomas-mozilla commented 5 years ago

You're right that the latest-mozilla-central directory is appended to rather than recreated (paired with an expiration policy to clean out older builds). Files are moved into that dir by a beetmover task, here's an example log for a recent linux64 nightly.

The artifacts are handled asynchronously, copying them first into the dated directory then to latest-mozilla-central, and the .txt file is started before the actual tar.bz for Firefox. In that particular log there's only a 10 second gap but longer is possible depending on network conditions. Another complication is that there is up to 14400 seconds (4h) of caching on the latest directory on archive.m.o, but I'm not sure how that would lead to files not found. More likely would be getting the previous nightly via stale copy of the .txt file.

In terms of alternatives

mozdownload could use buildhub to determine the latest nightly, eg 64bit linux. This gives a link to the dated directory. I'm not sure if there is an API equivalent you could use.
the wpt tests could get the latest nightly via bouncer, eg linux64 at https://download.mozilla.org/?product=firefox-nightly-latest-ssl&os=linux64&lang=en-US. This is what www.mozilla.org uses. NB this will use the latest-mozilla-central dir and is subject to the caching mentioned above

foolip commented 5 years ago

Would it be possible for this issue to be assigned to someone? It's coming up in https://github.com/web-platform-tests/wpt/issues/13274 with some regularity, requiring manual intervention each time as we can't tell the difference between mozdownload failures and other types of failures.

whimboo commented 5 years ago

@foolip I replied on the wpt issue with the 2nd part from @nthomas-mozilla reply. If that doesn't work, maybe increase the retry attempts and delays for mozdownload for now.

@nthomas-mozilla is there a way to prevent using the cache and always get a fresh copy? We currently try to do it via https://github.com/mozilla/mozdownload/blob/master/mozdownload/scraper.py#L425, but that might be wrong?

jgraham commented 5 years ago

I don't think the caching should be a problem because as noted there's no suggested mechanism for the cached file to not exist, whereas it looks like we are getting a pointer to a file that doesn't yet exist.

nthomas-mozilla commented 5 years ago

@whimboo AFAIK there's no cache busting that can be done. I had a look through the mozdownload source, and after retrieving the build date it seems get_build_info_for_date() will then try to parse the directory listing at firefox/nightly/YYYY/MM/ to make sure YYYY-MM-DD-hh-mm-ss is present. The listing also cached, for 15 minutes by looking at the headers. For this particular use case I suggest just scraping firefox/nightly/YYYY/MM/YYYY-MM-DD-hh-mm-ss/ directly.

mozilla / mozdownload

Occasional NotFoundError failures #524