mozilla / addons

β˜‚ Umbrella repository for Mozilla Addons ✨
Other
125 stars 41 forks source link

theme_update_counts_from_file management command got OOM killed #6795

Closed bqbn closed 5 years ago

bqbn commented 5 years ago

Describe the problem and steps to reproduce it:

There is a cron job that runs ./manage.py theme_update_counts_from_file daily (i.e. https://github.com/mozilla/addons-server/blob/master/scripts/crontab/crontab.tpl#L32). This job wasn't able to finish for 2019-07-04 date and it got OOM killed in the middle of running.

The kernel message showed the process used more than 6 GB memory before it's killed.

[Jul 6 04:48] gollum invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[  +0.001888] gollum cpuset=/ mems_allowed=0
[  +0.001039] CPU: 1 PID: 2132 Comm: gollum Not tainted 3.10.0-862.14.4.el7.x86_64 mozilla/addons-server#1

... ... 

[  +0.001939] Out of memory: Kill process 625 (python) score 779 or sacrifice child
[  +0.001822] Killed process 625 (python) total-vm:6469572kB, anon-rss:6276464kB, file-rss:0kB, shmem-rss:0kB

What happened?

We were alerted for missing stats for 2019-07-04, so we tried to manually re-run all the stats related cron jobs for the date, and during the process, we found that the theme_update_counts_from_file job wasn't able to finish and got OOM killed.

What did you expect to happen?

The job to finish without being OOM killed.

Anything else we should know?

N/A

bqbn commented 5 years ago

We did some more investigation, and tried to run the theme_update_counts_from_file.py in the management shell.

It turns out it is https://github.com/mozilla/addons-server/blob/master/src/olympia/stats/management/commands/theme_update_counts_from_file.py#L109 that caused the OOM, apparently it tried to load too many stuff from the database into the memory.

And it also appears this job has been failing for a while (possibly since the LWT migration), yet no one seems to have noticed or complained about it.

I'm not exactly sure what the purpose of this command serves, and whether or not it is still needed today. If not, maybe we can consider stop running it, or if it does have a purpose, the issue should probably be fixed.

Thanks.

EnTeQuAk commented 5 years ago

And it also appears this job has been failing for a while (possibly since the LWT migration), yet no one seems to have noticed or complained about it.

Since when has it been failing? Since the mentioned 2019-07-04 or even before that?

EnTeQuAk commented 5 years ago

This command specifically is being used for our statistics dashboards, specifically for theme updates. So, as long as we do have that dashboard we'll need that data.

The mentioned line got added in https://github.com/mozilla/addons-server/commit/46e38faf3c970f50f4d8d55bc2d1f8a0f720240e#diff-fbf940e09a29be25c651f1a38db092deR103 and it feels to me it does load too much data from the database indeed.

What this is doing is, selecting all UpdateCount instances for all add-on ids in migrated_personas which I suppose is a lot. The problem here is though, UpdateCount.addon_id is not unique on the model and there will be lots of UpdateCount instances per add-on. Unfortunately the structure is stored in a dictionary and all but the last UpdateCount instance will be overwritten / removed.

@eviljeff is this migration specific piece still needed? If so, we're going to have to optimize it a bit.

eviljeff commented 5 years ago

@EnTeQuAk we probably don't need it forever but for a period of time we should have it to record the usage from user of old versions of Firefox who didn't upgrade. If Product agreed suppose we could regard now as being a good enough period of time.

From what's said above it might be true that it's never worked properly which is disappointing because it was a challenge to develop and apparently I missed the day on the filter πŸ˜’ so challenge failed.

If Product still want to count old LWT users of the themes then it should be fixed (by adding day at least - not sure if that's enough optimization); Otherwise, it can be removed.

bqbn commented 5 years ago

Since when has it been failing? Since the mentioned 2019-07-04 or even before that?

My guess is it's been failing possibly since the LWT migration.

eviljeff commented 5 years ago

@jvillalobos to πŸ™„ and decide if it's worth fixing.

jvillalobos commented 5 years ago

The impact is that LWT users aren't being counted in theme usage stats? If so, I think it's fine to drop.

eviljeff commented 5 years ago

The impact is that LWT users aren't being counted in theme usage stats?

yes.

muffinresearch commented 5 years ago

Ok so if I understand correctly the action here is to remove this cron and the script.

eviljeff commented 5 years ago

Ok so if I understand correctly the action here is to remove this cron and the script.

yep. We're going to remove it rather than try to fix it.

AlexandraMoga commented 5 years ago

@eviljeff does this require QA?