mitocw / edx-platform

the edX learning management system (LMS) and course authoring tool, Studio
http://code.edx.org/
GNU Affero General Public License v3.0
12 stars 3 forks source link

Discovery: 100-200 500 errors every night from pickle #199

Closed pdpinch closed 8 years ago

pdpinch commented 8 years ago

After the semester stared and traffic increased, we started seeing 100-200 errors ever night from pickle. The specific errors are diverse, although in any given evening they tend to cluster around 2 or 3 variants. I can only assume users aren't getting these errors, because someone would have complained by now.

I'll post some example stack traces here, once I organize them.

pdpinch commented 8 years ago

We will need to check whether this still occurs in Dogwood. It's possible the issue has been addressed since the Cypress release.

pdpinch commented 8 years ago

Here are two samples of the most frequent error. On 2/16 one user generated 10 of these, 2 on a sequential URL and then 8 on the courseware URL. He e-mailed mitx-support to say he was getting 500s and then shortly e-mailed again to say that the issue had been resolved.

https://gist.github.com/pdpinch/115b587a292541b330df

amir-qayyum-khan commented 8 years ago

@pdpinch unable to reproduce this issue, found no change in execution path in dogwood and cypress. It seem on opening courseware/problem for first time using direct link or clicking on courseware tab some how coursekey gets corrupt. From exception it seems issue is here tree = self.metadata_inheritance_cache_subsystem.get(unicode(course_id), {})

TypeError: unhashable type: 'dict'

Although course key is converted into unicode, I am note sure abt this issue.

Course paths that i looked and compare are below for both dogwood and cypress

Cypress
Dogwood
pdpinch commented 8 years ago

I'm going to post some more errors. They all seem to be related to the use of pickle() for caching. Have you been able to find anything about that specifically? I think it's further down the stack than the code link you provided above.

pdpinch commented 8 years ago

Here's a gist with two more sample errors, https://gist.github.com/pdpinch/d227e9ac35d76539c365 We've got a few hundred of these.

 File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/django/core/cache/backends/filebased.py", line 41, in get
   return pickle.load(f)

TypeError: __import__() argument 1 must be string without null bytes, not str
amir-qayyum-khan commented 8 years ago

@pdpinch as for as i was looking into pickle related code, my findings are

Also see this
pdpinch commented 8 years ago

So do we need to add some exception handling around the pickle code, possibly with explicit casing?

On Feb 18, 2016, at 10:38 AM, Amir Qayyum Khan notifications@github.com wrote:

@pdpinch https://github.com/pdpinch as for as i was looking into (pickle related code)[https://github.com/mitocw/edx-platform/blob/mitx-cypress.1/common/lib/xmodule/xmodule/modulestore/split_mongo/mongo_connection.py#L244 https://github.com/mitocw/edx-platform/blob/mitx-cypress.1/common/lib/xmodule/xmodule/modulestore/split_mongo/mongo_connection.py#L244], my findings are

User set's cache in mongo. When he tries to get cache some how key gets corrupt and app was not able to find cache against giving key. From (exception data you share)[https://gist.github.com/pdpinch/115b587a292541b330df https://gist.github.com/pdpinch/115b587a292541b330df] it seems here tree = self.metadata_inheritance_cache_subsystem.get(unicode(course_id), {}) is issue in coursekey. Also see this

kacarstensen-shift/nose_gevent_multiprocess@16568c5 https://github.com/kacarstensen-shift/nose_gevent_multiprocess/commit/16568c51f6f2c6a014b272b39ed6257f394fbf60 http://stackoverflow.com/questions/13264511/typeerror-unhashable-type-dict http://stackoverflow.com/questions/13264511/typeerror-unhashable-type-dict — Reply to this email directly or view it on GitHub https://github.com/mitocw/edx-platform/issues/199#issuecomment-185780569.

amir-qayyum-khan commented 8 years ago

yes, so should i create PR? which branch i should target for PR? On 18-Feb-2016 9:02 pm, "Peter Pinch" notifications@github.com wrote:

So do we need to add some exception handling around the pickle code, possibly with explicit casing?

On Feb 18, 2016, at 10:38 AM, Amir Qayyum Khan notifications@github.com wrote:

@pdpinch https://github.com/pdpinch as for as i was looking into (pickle related code)[ https://github.com/mitocw/edx-platform/blob/mitx-cypress.1/common/lib/xmodule/xmodule/modulestore/split_mongo/mongo_connection.py#L244 < https://github.com/mitocw/edx-platform/blob/mitx-cypress.1/common/lib/xmodule/xmodule/modulestore/split_mongo/mongo_connection.py#L244>], my findings are

User set's cache in mongo. When he tries to get cache some how key gets corrupt and app was not able to find cache against giving key. From (exception data you share)[ https://gist.github.com/pdpinch/115b587a292541b330df < https://gist.github.com/pdpinch/115b587a292541b330df>] it seems here tree = self.metadata_inheritance_cache_subsystem.get(unicode(course_id), {}) is issue in coursekey. Also see this

kacarstensen-shift/nose_gevent_multiprocess@16568c5 < https://github.com/kacarstensen-shift/nose_gevent_multiprocess/commit/16568c51f6f2c6a014b272b39ed6257f394fbf60

http://stackoverflow.com/questions/13264511/typeerror-unhashable-type-dict < http://stackoverflow.com/questions/13264511/typeerror-unhashable-type-dict

— Reply to this email directly or view it on GitHub < https://github.com/mitocw/edx-platform/issues/199#issuecomment-185780569>.

— Reply to this email directly or view it on GitHub https://github.com/mitocw/edx-platform/issues/199#issuecomment-185791273 .

pdpinch commented 8 years ago

After speaking with Dave Ormsbee about this, it seems to be an issue specifically in the metadata_inheritance_cache (maybe that was obvious to you already)

The default settings for this cache in devstack (lms and cms) is to use memcached, which presumably never touches the pickle code. I'm still trying to determine how this is set in production for MITx. Once I determine that you can do some testing using that configuration.

pdpinch commented 8 years ago

Nothing terribly surprising, but I've confirmed that our production servers use the FileBasedCache for mongo_metadata_inheritance. In fact, mongo_metadata_inheritance is the only cache we use FileBasedCache for.

in /edx/app/edxapp/lms.env.json:

        "mongo_metadata_inheritance": {
            "BACKEND": "django.core.cache.backends.filebased.FileBasedCache", 
            "KEY_FUNCTION": "util.memcache.safe_key", 
            "KEY_PREFIX": "integration_mongo_metadata_inheritance", 
            "LOCATION": "/var/tmp/mongo_metadata_inheritance", 
            "TIMEOUT": 300
        }, 

I'll ask devops what they think about changing the cache, but in the meantime can you look into how this could be failing?

amir-qayyum-khan commented 8 years ago

@pdpinch Now I am able to execute the code mentioned in trance.

pdpinch commented 8 years ago

You're probably right. The pattern is very spikey. I think they're correlated to traffic, but I need to check that still.

Is there a way that we can catch the exception(s) and invalidate the cache? One of the annoying things is that we see the same exception dozens of times, so it tries to use the same bad cache over and over again.

pdpinch commented 8 years ago

@mitocw/odl-engineering-devops what are your thoughts on changing the cache backend for mongo-metadata-inheritance? Are there options besides memcache?

IIRC, @carsongee claimed that we don't hit this cache frequently because we have many courses and (relative to edX) few users. The cache doesn't live long enough to get filled.

Also, is there any reasons to suspect that the location of the cache on disk might cause problems? /var/tmp/mongo_metadata_inheritance wouldn't be under gluster right? Is it EBS?

amir-qayyum-khan commented 8 years ago

@pdpinch i can add try catch statement? Should I add and create PR? Against which branch?

blarghmatey commented 8 years ago

As far as alternative options, the most widely used option besides memcached is Redis. As to the underlysing filesystem, I can do some exploration if needed to figure that out.

pdpinch commented 8 years ago

The residential team wants to put off the dogwood upgrade until the end of the semester, so we can't wait and hope that django 1.8 will fix this (or some of the other changes in dogwood). We need to do something soon, and I'm leaning towards changing the cache to memcached (or possibly redis) instead of filebased.

I think the discovery part of this can be considered complete however. I opened a follow-up ticket at https://github.com/mitodl/configuration/issues/52

pdpinch commented 8 years ago

The residential team is pushing for delaying the dogwood upgrade until the end of the semester in May. That means we can't rely on a django 1.8 upgrade to get us out of this problem.

I think we should investigate changing the cache from filebased. I've opened a ticket to that effect, https://github.com/mitodl/configuration/issues/52

We can close this discovery ticket.