pangeo-forge / pangeo-forge-orchestrator

Database API and GitHub App backend for Pangeo Forge Cloud.
https://api.pangeo-forge.org/docs
Apache License 2.0
4 stars 1 forks source link

Reduce concurrency to 2 Gunicorn workers #178

Closed andersy005 closed 1 year ago

andersy005 commented 1 year ago

This is an attempt at addressing recent memory issues

Screenshot 2022-11-02 at 11 07 33 AM Screenshot 2022-11-02 at 11 06 56 AM
yuvipanda commented 1 year ago

This could also be because of memory use when running the recipe files?

In addition, I'd suggest increasing the dyno size too on heroku!

andersy005 commented 1 year ago

This could also be because of memory use when running the recipe files?

i hadn't thought about this. I presume by "recipe runs" you mean when we execute the recipe modules via pangeo-forge-runner expand-meta ... for registration, right?

andersy005 commented 1 year ago

In addition, I'd suggest increasing the dyno size too on heroku!

Do you have any recommendation? We are currently using hobby dyno, and It seems the next dyno type standard-1x isn't that different (memory wise).

Screenshot 2022-11-02 at 12 08 46 PM
andersy005 commented 1 year ago

In addition, I'd suggest increasing the dyno size too on heroku!

As a first pass, i'm going to enable log-runtime-metrics to track load and memory using four our current dyno: https://devcenter.heroku.com/articles/log-runtime-metrics.

yuvipanda commented 1 year ago

@andersy005 measuring seems right next step!

andersy005 commented 1 year ago

@yuvipanda, something is going on during the pangeo-forge-runner expand-meta .... call

Here's the memory profile after a reboot

2022-11-02T18:30:10.845831+00:00 heroku[web.1]: source=web.1 dyno=heroku.247104119.54df4cd5-f10c-4baa-b412-32d8fa56c24d sample#memory_total=149.32MB sample#memory_rss=148.88MB sample#memory_cache=0.45MB sample#memory_swap=0.00MB sample#memory_pgpgin=69641pages sample#memory_pgpgout=31414pages sample#memory_quota=512.00MB

I then launch a test run for this recipe: https://github.com/pangeo-forge/staged-recipes/pull/215

After calling pangeo-forge-runner expand-meta ..., i started noticing memory spikes


2022-11-02T18:32:02.363030+00:00 app[web.1]: 2022-11-02 18:32:02,362 DEBUG - orchestrator - Running command: ['pangeo-forge-runner', 'bake', '--repo=https://github.com/norlandrhagen/staged-recipes', '--ref=8308f82cbdede7d8039a72e4137e5d16c800eb89', '--json', '--prune', '--Bake.recipe_id=NWM', '-f=/tmp/tmp985ps8od.json', '--feedstock-subdir=recipes/NWM']
2022-11-02T18:32:14.054996+00:00 heroku[web.1]: source=web.1 dyno=heroku.247104119.54df4cd5-f10c-4baa-b412-32d8fa56c24d sample#load_avg_1m=0.63
2022-11-02T18:32:14.188714+00:00 heroku[web.1]: source=web.1 dyno=heroku.247104119.54df4cd5-f10c-4baa-b412-32d8fa56c24d sample#memory_total=329.25MB sample#memory_rss=326.84MB sample#memory_cache=2.41MB sample#memory_swap=0.00MB sample#memory_pgpgin=122482pages sample#memory_pgpgout=38195pages sample#memory_quota=512.00MB

notice how the memory increased from 149MB to 326MB. The memory eventually blew up, and heroku restarted the workers

2022-11-02T18:34:53.563144+00:00 heroku[web.1]: source=web.1 dyno=heroku.247104119.54df4cd5-f10c-4baa-b412-32d8fa56c24d sample#memory_total=826.02MB sample#memory_rss=511.88MB sample#memory_cache=0.00MB sample#memory_swap=314.14MB sample#memory_pgpgin=255319pages sample#memory_pgpgout=124278pages sample#memory_quota=512.00MB
2022-11-02T18:34:53.720844+00:00 heroku[web.1]: Process running mem=826M(161.3%)
2022-11-02T18:34:53.926451+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2022-11-02T18:34:54.931260+00:00 app[web.1]: [2022-11-02 18:34:54 +0000] [57] [CRITICAL] WORKER TIMEOUT (pid:58)
2022-11-02T18:34:54.964405+00:00 app[web.1]: [2022-11-02 18:34:54 +0000] [57] [WARNING] Worker with pid 58 was terminated due to signal 6
2022-11-02T18:34:55.311602+00:00 app[web.1]: [2022-11-02 18:34:55 +0000] [122] [INFO] Booting worker with pid: 122
2022-11-02T18:34:57.219544+00:00 app[web.1]: [2022-11-02 18:34:57 +0000] [122] [INFO] Started server process [122]
2022-11-02T18:34:57.219620+00:00 app[web.1]: [2022-11-02 18:34:57 +0000] [122] [INFO] Waiting for application startup.
2022-11-02T18:34:57.220136+00:00 app[web.1]: [2022-11-02 18:34:57 +0000] [122] [INFO] Application startup complete.

My suspicion is that the expansion of the meta-information of pangeo-forge runner is the cause of this spike. Not sure if the s3 crawling in https://github.com/pangeo-forge/staged-recipes/pull/215 could also be another reason this recipe in particular is running into this memory issues.

andersy005 commented 1 year ago