mozilla / missioncontrol

Real-time monitoring of Firefox release health
Mozilla Public License 2.0
21 stars 18 forks source link

Investigate increases in disk usage with experiments enabled #193

Closed wlach closed 6 years ago

wlach commented 6 years ago

We overloaded the 8GB postgres instance @whd set up yesterday, after turning on experiments. He doubled it to 16GB which temporarily fixed the problems but it seems as if we're using up the additional headroom quickly:

image

I'll investigate what's causing this. I don't want to waste too much time micro-optimizing for space, but it looks like something is genuinely wrong here.

wlach commented 6 years ago

Using this recipe here are the top 5 results:

                          table_name                          | table_size | indexes_size | total_size 
--------------------------------------------------------------+------------+--------------+------------
 "public"."django_celery_results_taskresult"                  | 6852 MB    | 1186 MB      | 8038 MB
 "public"."datum"                                             | 456 MB     | 355 MB       | 811 MB
 "pg_catalog"."pg_depend"                                     | 488 kB     | 720 kB       | 1208 kB
 "pg_catalog"."pg_proc"                                       | 608 kB     | 320 kB       | 928 kB
 "pg_catalog"."pg_attribute"                                  | 496 kB     | 216 kB       | 712 kB

So basically we're consuming almost all of our space with these silly task result entries. :/ Should be easily fixable.

wlach commented 6 years ago

Spent quite a bit of time with @jezdez trying to figure out why these weren't getting expired -- I believe the root cause is that there was such an accumulation of them over time (back when we we had a large backlog of tasks). Assuming things are normal, the backend_cleanup task should take care of these, but we suspect that was failing due to timeouts because the table got so huge (with tons of tasks going all the way back to August). I'm manually purging this table in the hopes that will bring things back to normal.

wlach commented 6 years ago

After manually purging the table, the backend_cleanup task seems to be running again. I also vacuumed the table, which freed up a ton of space.

I filed a pr with some celery fixes in #196, including a settings change to ensure that task records are expired after just one day (there really isn't much point in keeping them around for any longer, they aren't particularly useful for debugging).