mozilla / bedrock

Making mozilla.org awesome, one pebble at a time
https://www.mozilla.org
Mozilla Public License 2.0
1.18k stars 919 forks source link

Pushing to run-integration-tests sometimes doesn't load locale data? #14396

Open alexgibson opened 7 months ago

alexgibson commented 7 months ago

Description

We've been seeing these kinds of failures intermittently for a while now where locale specific tests fail. It appears almost like locale data is not there when the tests run. Pushing again sometimes fixes things, but the problem doesn't seem to be going away. We should try and figure out what's going on.

https://github.com/mozilla/bedrock/actions/runs/8520133305

stevejalim commented 7 months ago

My suspicion is that it's a race condition between the containers spinning up at the end of a deployment (which then hits a webhook on mozilla/bedrock that starts the integration/headless tests) and the container's own process pulling down l10n files on startup. If the container starts getting hammered (including by reruns) that might slow/delay the l10n update form longer

alexgibson commented 7 months ago

@stevejalim how come we only see this for the tests branch and not in our regular CI for dev / stage / prod? Shouldn’t the l10n data be fully pulled down before the site is considered to be deployed?

stevejalim commented 7 months ago

I'd need to look to check if the test build is something different from the prod build. Am pretty sure we don't ship an image to prod containing our dev/test deps. Will get back you you

stevejalim commented 7 months ago

It might also be that resources are allocated differently for test than dev

stevejalim commented 7 months ago

So Bedrock Test has vastly more resources than Dev, so it's not that.

https://github.com/mozilla-it/webservices-infra/blob/1739ef3f81e88ca0bd05c42470af2bd2fb7670cc/bedrock/k8s/bedrock/values-test.yaml#L70-L76 vs https://github.com/mozilla-it/webservices-infra/blob/1739ef3f81e88ca0bd05c42470af2bd2fb7670cc/bedrock/k8s/bedrock/values-dev.yaml#L55-L61

stevejalim commented 7 months ago

And looking about a bit, I now don't think the test image is any different to the regular image we ship to dev/stage/prod. Odd.

stevejalim commented 7 months ago

Shouldn’t the l10n data be fully pulled down before the site is considered to be deployed?

Yep, and we can see that happening here (which is called by this, which is called in the Dockerfile)

All of which makes me wonder if the data/www-l10n-team directory isn't necessarily available reliably - maybe it's eventually consistent or something similar, which shows up more when the deployment is fresh. (But I'm just thinking aloud right now and need to dig more)

stevejalim commented 7 months ago

So, one thing that's different on bedrock-test compared to bedrock-dev and -stage and -prod is that in test mode we run bedrock with supervisord enabled:

https://github.com/mozilla-it/webservices-infra/blob/main/bedrock/k8s/bedrock/values-test.yaml#L114

When RUN_SUPERVISOR is set to True, bedrock is booted up the running of this script that appears to fake a locale sync having happening so that bedrock will start.

It also runs a clock process that is always called with at least the 'file' arg, which means we also update files every 5 (by default) minutes, which includes updating the l10n files

So, it's maybe possible that a) bedrock can start without l10n files available and the l10n update process takes a while to complete, so we're missing locales or b) sometimes we catch the test server updating it's l10n files and so we're missing locales

But I'd welcome a second opinion on that from @pmac as I may be misinterpreting or there may be more nuance if I dig deeper

alexgibson commented 7 months ago

Nice investigation @stevejalim!

I've noticed that tests always seem to fail on first deployment (it no longer seems to be intermittent from what I can tell?) Not sure if that's useful, or if it points to something that has recently changed maybe? Just thought I'd add here.

janbrasna commented 7 months ago

This is probably unrelated, but about a week ago this started appearing in all the logs:

#39 5.149 + ./manage.py l10n_update
#39 5.810 Using SITE_MODE of 'Mozorg'
#39 6.377 System check identified some issues:
#39 6.377 WARNINGS:
#39 6.377 ?: (staticfiles.W004) The directory '/app/assets' in the STATICFILES_DIRS setting does not exist.