Closed cdrini closed 4 years ago
We rolled back the deploy yesterday to resolve
We were noticing high memory usage on ol-web[34] as well as high swap rates on ol-web4. This suggests a memory leak.
Committed memory on ol-web4 (web3 was similary)
Swap memory on ol-web3 (0 on web4) http://ol-web3.us.archive.org:8088/mrtg/committed.html
Here's what went out on prod on deploy: https://github.com/internetarchive/openlibrary/compare/deploy-2020-01-09...deploy-2020-01-16?w=1
This files seem like the largest changes:
I'm not sure what 2 files you're talking about since those three links all show the same contents, which looks like a list of commits.
Oh, that seems like a Github bug. Click on "files changed" tab, ~then press enter in the URL bar~. Does that work? The hash should go to the right file in the diff
That doesn't work for me either. Do these files have names?
We've reverted compress.py on 970b31b
on ol-web3
and are waiting to see what happens with memory:
compress.py revert did not do the trick, following plan to test get_metadata
Wondering if we may be memoizing more content?
We froze up on ol-web3 when testing our compress.py rollback and restarted here: https://gnt-webmgr.us.archive.org/cluster/cluster1/ol-web3.us.archive.org#overview
This is the reverse diff for the get_metadata PR:
git diff 836044c3 836044c3~
We're going to try the above command now :+1:
Yep; with the revert on web3 for ~1hr, we didn't see the spike, so it looks like the memory issue was introduced in #2838
In addition to all the code reorganization, there's also a switch from urllib to Requests, which I consider fairly significant. I wouldn't expect it to cause an issue, but it's worth investigating.
Possible Requests based memory leak on Python 2.7 https://github.com/psf/requests/issues/4553
Is the "Steps to Close" a thing? If so, why is this closed?
With my teams, I'd also address:
This is waiting for the "What caused it?" section, which will be determined once #2899 completed.
What caused it, https://github.com/internetarchive/openlibrary/pull/2909#issuecomment-578580052
Existing code was double caching. My first refactor must have enable one of the paths, that while it looked like it should have been used, clearly wasn't. Removing the duplication fixed the memory usage.
Fixed in #2909
Summary
Steps to close
Affects:
label applied?