Closed travis79 closed 8 months ago
cc @badboy what are your thoughts on this? Is this normal due to some uplifts or something? Is there something we can do to catch and automate this sort of thing? I don't know if it help to do a full dry-run
in CI or not, but right now it only does python -m probe_scraper.check_repositories
which only validates the URL and yaml file paths are good and doesn't get into validating commits like dry-run does.
A full dry-run is LOOOOONG, which is why we don't do it in CI. It's also not even necessarily exactly what runs in airflow, so it might still not be perfect anyway.
As said in channel: adding so many skip commits feels wrong, so maybe we should look into how to handle that better; but if that clears it up for the time being I won't stand in the way of course but we should involve other DEs for visibility
The ACTUAL dry-run you want to be running is this:
python3 -m probe_scraper.runner --dry-run --cache-dir tmp/cache --out-dir tmp/out --glean --glean-url=https://github.com/mozilla/gecko-dev --output-bucket=gs://probe-scraper-prod-artifacts/ --glean-limit-date=2024-01-09 --update
because that closely resembles what is run nightly in airflow. Notice the
glean-limit-date
(and update). That teaches probe-scraper to go back at most to that date. We incrementally add new metrics and don't have to re-parse old commits. Yes, that means a dry-run over everything in gecko-dev won't work as is.I don't think fixing that dry-run-over-everything case is worth it here. It will happen again in the future and we won't notice.
So really the next steps should be:
* Document that one should dry-run with a limit-date * Require from everyone to include the exact command they ran when a dry-run failed.
Oh, I ran the more specific command plenty of times.... but I've had unusual things (like commits that need skipped) show up after a PR has landed that breaks probe-scraper so I've gotten into the habit of doing full dry-runs locally to give myself some piece of mind.
Of course dry-runs pass if you don't process the broken things 😆, but wouldn't this eventually happen in production just like in a local dry-run? Can we set the --glean-limit-date
param there?
glean-limit-date is set for the nightly run: https://github.com/mozilla/telemetry-airflow/blob/03274fdad2ade74de6b51a36e4e4fb7f15985240/dags/probe_scraper.py#L169-L173
Based on your assurances that this isn't something that could break in production thanks to the --glean-limit-date
, I'm going to close this PR as unnecessary.
Adding a little comment here at the end:
This problem is a little unique to how probe-scraper deals with gecko
and other mozilla-central-resident (mirrored to mozilla/gecko-dev) Glean libraries and applications. It is not uncommon for code to be backed out (reverted) on autoland
before it lands in mozilla-central proper. gecko-dev/master
contains these backouts, which leads to holes in the history where a file that had been added was later removed and then later re-added.
These holes aren't common, but aren't uncommon. And there will only ever be a finite number of them. So a manual, weird process like SKIP_COMMITS
is fine. But, really, we shouldn't be tasking probe-scraper with looking at all those autoland
bounces in the first place.
I think the non-resolution of this is just fine, and we can continue not resolving this into the future. Maybe when mozilla-central moves to git
(without needing to mirror to mozilla/gecko-dev) we'll have more options for clever solutions. Like push-mode.
Some gecko metrics.yaml files seemed to be missing from certain commits in my local dry-runs, not sure exactly why but these are the commits that need to be skipped.