Open alexgibson opened 7 months ago
My suspicion is that it's a race condition between the containers spinning up at the end of a deployment (which then hits a webhook on mozilla/bedrock that starts the integration/headless tests) and the container's own process pulling down l10n files on startup. If the container starts getting hammered (including by reruns) that might slow/delay the l10n update form longer
@stevejalim how come we only see this for the tests branch and not in our regular CI for dev / stage / prod? Shouldn’t the l10n data be fully pulled down before the site is considered to be deployed?
I'd need to look to check if the test build is something different from the prod build. Am pretty sure we don't ship an image to prod containing our dev/test deps. Will get back you you
It might also be that resources are allocated differently for test than dev
So Bedrock Test has vastly more resources than Dev, so it's not that.
https://github.com/mozilla-it/webservices-infra/blob/1739ef3f81e88ca0bd05c42470af2bd2fb7670cc/bedrock/k8s/bedrock/values-test.yaml#L70-L76 vs https://github.com/mozilla-it/webservices-infra/blob/1739ef3f81e88ca0bd05c42470af2bd2fb7670cc/bedrock/k8s/bedrock/values-dev.yaml#L55-L61
And looking about a bit, I now don't think the test image is any different to the regular image we ship to dev/stage/prod. Odd.
Shouldn’t the l10n data be fully pulled down before the site is considered to be deployed?
Yep, and we can see that happening here (which is called by this, which is called in the Dockerfile)
All of which makes me wonder if the data/www-l10n-team
directory isn't necessarily available reliably - maybe it's eventually consistent or something similar, which shows up more when the deployment is fresh. (But I'm just thinking aloud right now and need to dig more)
So, one thing that's different on bedrock-test
compared to bedrock-dev
and -stage
and -prod
is that in test mode we run bedrock with supervisord
enabled:
https://github.com/mozilla-it/webservices-infra/blob/main/bedrock/k8s/bedrock/values-test.yaml#L114
When RUN_SUPERVISOR
is set to True
, bedrock is booted up the running of this script that appears to fake a locale sync having happening so that bedrock will start.
It also runs a clock process that is always called with at least the 'file' arg, which means we also update files every 5 (by default) minutes, which includes updating the l10n files
So, it's maybe possible that a) bedrock can start without l10n files available and the l10n update process takes a while to complete, so we're missing locales or b) sometimes we catch the test server updating it's l10n files and so we're missing locales
But I'd welcome a second opinion on that from @pmac as I may be misinterpreting or there may be more nuance if I dig deeper
Nice investigation @stevejalim!
I've noticed that tests always seem to fail on first deployment (it no longer seems to be intermittent from what I can tell?) Not sure if that's useful, or if it points to something that has recently changed maybe? Just thought I'd add here.
This is probably unrelated, but about a week ago this started appearing in all the logs:
#39 5.149 + ./manage.py l10n_update
#39 5.810 Using SITE_MODE of 'Mozorg'
#39 6.377 System check identified some issues:
#39 6.377 WARNINGS:
#39 6.377 ?: (staticfiles.W004) The directory '/app/assets' in the STATICFILES_DIRS setting does not exist.
Description
We've been seeing these kinds of failures intermittently for a while now where locale specific tests fail. It appears almost like locale data is not there when the tests run. Pushing again sometimes fixes things, but the problem doesn't seem to be going away. We should try and figure out what's going on.
https://github.com/mozilla/bedrock/actions/runs/8520133305