internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.22k stars 1.37k forks source link

Last deploy causing 503s on prod #2891

Closed cdrini closed 4 years ago

cdrini commented 4 years ago

Summary

Steps to close

  1. [x] Assignment: Is someone assinged to this issue? (joint effort between the two)
  2. [x] Labels: Is there an Affects: label applied?
  3. [x] Diagnosis: Add a description and scope of the issue
  4. [x] Updates: As events unfold, is noteable provenance documented in issue comments? (i.e. useful debug commands / steps / learnings / reference links)
  5. [x] "What caused it?" - please answer in summary
  6. [x] "What fixed it?" - please answer in summary
  7. [x] "Followup actions:" actions added to summary
cdrini commented 4 years ago

We rolled back the deploy yesterday to resolve

cdrini commented 4 years ago

We were noticing high memory usage on ol-web[34] as well as high swap rates on ol-web4. This suggests a memory leak.

Committed memory on ol-web4 (web3 was similary) image

Swap memory on ol-web3 (0 on web4) http://ol-web3.us.archive.org:8088/mrtg/committed.html image

cdrini commented 4 years ago

Here's what went out on prod on deploy: https://github.com/internetarchive/openlibrary/compare/deploy-2020-01-09...deploy-2020-01-16?w=1

This files seem like the largest changes:

tfmorris commented 4 years ago

I'm not sure what 2 files you're talking about since those three links all show the same contents, which looks like a list of commits.

cdrini commented 4 years ago

Oh, that seems like a Github bug. Click on "files changed" tab, ~then press enter in the URL bar~. Does that work? The hash should go to the right file in the diff

tfmorris commented 4 years ago

That doesn't work for me either. Do these files have names?

mekarpeles commented 4 years ago

We've reverted compress.py on 970b31b on ol-web3 and are waiting to see what happens with memory:

http://ol-web3.us.archive.org:8088/mrtg/committed.html

mekarpeles commented 4 years ago

compress.py revert did not do the trick, following plan to test get_metadata

Wondering if we may be memoizing more content?

We froze up on ol-web3 when testing our compress.py rollback and restarted here: https://gnt-webmgr.us.archive.org/cluster/cluster1/ol-web3.us.archive.org#overview

cdrini commented 4 years ago

This is the reverse diff for the get_metadata PR:

git diff 836044c3 836044c3~
cdrini commented 4 years ago

We're going to try the above command now :+1:

cdrini commented 4 years ago

Yep; with the revert on web3 for ~1hr, we didn't see the spike, so it looks like the memory issue was introduced in #2838

http://ol-web3.us.archive.org:8088/mrtg/committed.html image

tfmorris commented 4 years ago

In addition to all the code reorganization, there's also a switch from urllib to Requests, which I consider fairly significant. I wouldn't expect it to cause an issue, but it's worth investigating.

tfmorris commented 4 years ago

Possible Requests based memory leak on Python 2.7 https://github.com/psf/requests/issues/4553

tfmorris commented 4 years ago

Is the "Steps to Close" a thing? If so, why is this closed?

With my teams, I'd also address:

cdrini commented 4 years ago

This is waiting for the "What caused it?" section, which will be determined once #2899 completed.

hornc commented 4 years ago

What caused it, https://github.com/internetarchive/openlibrary/pull/2909#issuecomment-578580052

Existing code was double caching. My first refactor must have enable one of the paths, that while it looked like it should have been used, clearly wasn't. Removing the duplication fixed the memory usage.

Fixed in #2909