mozilla / probe-scraper

Scrape and publish Telemetry probe data from Firefox
https://mozilla.github.io/probe-scraper/
Mozilla Public License 2.0
21 stars 53 forks source link

probe scraper unable to download file for tree: integration/mozilla-inbound #488

Open relud opened 2 years ago

relud commented 2 years ago

revision in tree integration/mozilla-inbound isn't available outside of probe-scraper's cache:

Retreiving Buildhub results for channel nightly
  4645 revisions found
...
  Downloading files for revision number 494/4645 - revision: 46fe2115d46a5bb40523b8466341d8f9a26e1bdf, tree: integration/mozilla-inbound, version: 49.0a1
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/app/probe_scraper/runner.py", line 833, in <module>
    main(
  File "/app/probe_scraper/runner.py", line 647, in main
    upload_paths += load_moz_central_probes(
  File "/app/probe_scraper/runner.py", line 323, in load_moz_central_probes
    revision_data = moz_central_scraper.scrape_channel_revisions(
  File "/app/probe_scraper/scrapers/moz_central_scraper.py", line 207, in scrape_channel_revisions
    files = download_files(
  File "/app/probe_scraper/scrapers/moz_central_scraper.py", line 123, in download_files
    raise Exception(
Exception: Request returned status 404 for https://hg.mozilla.org/releases/integration/mozilla-inbound/raw-file/46fe2115d46a5bb40523b8466341d8f9a26e1bdf/toolkit/components/telemetry/Histograms.json

This is locally reproducible for me by running:

python3 -m probe_scraper.runner --out-dir=temp/probe_data --cache-dir temp/probe_cache --moz-central --firefox-version=49 --firefox-channel=nightly

and is fixed by manually downloading s3://telemetry-airflow-cache/cache/probe-scraper/hg/46fe2115d46a5bb40523b8466341d8f9a26e1bdf/toolkit/components/telemetry/Histograms.json into my local cache.

relud commented 2 years ago

I modified probe_scraper/scrapers/moz_central_scraper.py to try and find all missing revisions, and this appears to be the only one.

my changes: ```diff diff --git a/probe_scraper/scrapers/moz_central_scraper.py b/probe_scraper/scrapers/moz_central_scraper.py index 61dea29..4c5ed1f 100644 --- a/probe_scraper/scrapers/moz_central_scraper.py +++ b/probe_scraper/scrapers/moz_central_scraper.py @@ -194,25 +194,34 @@ def scrape_channel_revisions( print(" " + str(num_revisions) + " revisions found") + trees = set() for i, rd in enumerate(revision_dates): - revision = rd["revision"] + if rd["tree"] not in trees: + if rd["tree"] != "integration/mozilla-inbound": + trees.add(rd["tree"]) - print( - ( - f" Downloading files for revision number {str(i+1)}/{str(num_revisions)}" - f" - revision: {revision}, tree: {rd['tree']}, version: {str(rd['version'])}" + revision = rd["revision"] + + print( + ( + f" Downloading files for revision number {str(i+1)}/{str(num_revisions)}" + f" - revision: {revision}, tree: {rd['tree']}, version: {str(rd['version'])}" + ) ) - ) - version = extract_major_version(rd["version"]) - files = download_files( - channel, revision, folder, error_cache, version, tree=rd["tree"] - ) - - results[channel][revision] = { - "date": rd["date"], - "version": version, - "registries": files, - } - save_error_cache(folder, error_cache) + version = extract_major_version(rd["version"]) + try: + files = download_files( + channel, revision, folder, error_cache, version, tree=rd["tree"] + ) + + results[channel][revision] = { + "date": rd["date"], + "version": version, + "registries": files, + } + except Exception: + import traceback + traceback.print_exc() + save_error_cache(folder, error_cache) return results ```
relud commented 2 years ago

for now I've asked Data SRE to copy the missing cache file to the new cache location, https://mozilla-hub.atlassian.net/browse/DSRE-1001?focusedCommentId=590672, but idk if there's a long-term solution needed here.

cc @chutten

chutten commented 2 years ago

...why are we pulling mozilla-inbound? Surely we only care about mozilla-central? Branches on /integration/ don't ship binaries we'd expect to receive data from, so we shouldn't need to care much about what is or isn't present on them.

relud commented 2 years ago

we're pulling from that tree because it's listed by buildhub. we don't (currently) filter what buildhub returns for firefox versions when scraping legacy telemetry in prod. specifically for firefox nightly 49.0a1, buildhub returns a list that includes revision: 46fe2115d46a5bb40523b8466341d8f9a26e1bdf, tree: integration/mozilla-inbound