hailsaturn commented 3 years ago

I’m having some issues with the LTI tool:

The LTI tool is not reporting scores to the VLE since updating from 2.9 to 2.13.
I went directly from 2.9 to 2.13, not for any particular reason; I just didn't update.
Updated to 2.13 around mid-June. Followed upgrade instructions; huey is running, etc.

What I've found:

It looks like there are some nasty bottlenecks in the diff_suspend_data task.
I updated in June, and it is still working through the backlog of diff_suspend_data.
Since updating, no scores have been reported to the VLE at all.

I know enough to figure out some things, but not enough to say I'm completely correct. Outlining my thought process below:

BEGIN "detective work"

I saw this issue: https://github.com/numbas/numbas-lti-provider/issues/133

Decided to dig around a bit and realised that some of the diff_suspend_data tasks are taking hours to complete. The vast majority of the attempts are being handled in a matter of seconds, but there are a few like this that seem to be taking a long time:

To further investigate, I added some extra logging details to the diff_suspend_data task in tasks.py (lines 5 and 10 below):

 #import logging at beginning of file
 @periodic_task(crontab(minute='*'))
 def diff_suspend_data():
     attempts = Attempt.objects.filter(diffed=False)
     logging.getLogger('huey').info("Number of attempts remaining: {}".format(len(attempts)))
     MAX_TIME = 10
     start = datetime.now()
     if attempts.exists():
         for a in attempts:
             logging.getLogger('huey').info("Attempt detail: {}".format()))
             diff_scormelements(a)
             t2 = datetime.now()
             if (t2-start).total_seconds()>MAX_TIME:
                 break

This revealed the following (with student information redacted): In this case, it seems like Workers 1, 2 and 4 are trying to handle the same attempt at once.

If the attempt is diffed in seconds, no problem.
But when the attempts (like those above) take several hours, it seems to end up with all workers simultaneously trying to do one task, and they are all held up for that entire duration.

END “detective work"

I ended up doing this just to try it out:

Modified the diff_suspend_data task so that a random element of the attempts list is chosen, instead of just the first one. It is a bit silly, but it reduces the chance of duplicating work, and is the best I can do with my ability.

But it is still taking a long time to finish, and still not reporting to the VLE.

I am wondering the following:

Is the fact that diffing some attempts takes hours something that could be better optimised? Or just a fact of life?
Is it likely that this lengthy diff process is what is preventing the LTI tool from reporting to the VLE?
- If so, is it possible to speed it up/bypass it?
- Otherwise, what else could be preventing reports to the VLE?

christianp commented 2 years ago

Thanks for looking into this in so much detail! It's hard to debug the task-runner stuff at a scale where these problems show up.

Could you send me the JSON attempt data for an attempt that takes hours to diff? I've seen it take a few minutes, but hours is very bad indeed.

I've added priorities to the tasks that the server runs, so reporting scores should always take precedence over diffing suspend data. To turn that on, update to the latest version on the master branch (commit 7f177cc8be48d442b85daa7a32845af35e2bbb13), and add the following to numbasltiprovider/settings.py:

HUEY = {
    'huey_class': 'huey.PriorityRedisHuey',
}

If you try that, please let me know if it helps. I can't try it on our server because we're in the middle of our exam resit period and the priorities need a newer version of Redis than our version of Ubuntu provides.

hailsaturn commented 2 years ago

In convenient timing, after I added the randomised selection of attempts to diff, it didn't take too long to finish everything, and got to the end of the queue just recently.

...only to find it was still not reporting scores to the VLE.

But I have figured out why! The lis_result_sourcedid field of each attempt was returning None, so the report_outcome function in _reportoutcome.py was doing nothing.

In the file _/opt/numbas_lti_python/lib/python3.6/site-packages/django_auth_lti/middlewarepatched.py the function __call__ in the class MultiLTILaunchAuthMiddleware defines this variable:

lti_launch = {
           'launch_time': datetime.now().isoformat(),
           'context_id': request.POST.get('context_id', None),
#etc... 
           'user_image': request.POST.get('user_image', None),
}

Missing from this list on my end was one line:

'lis_result_sourcedid': request.POST.get('lis_result_sourcedid', None),

Once I added that line to _middlewarepatched.py, it began reporting to the LMS again. I'm not sure if this was an error in the upgrade process, or an error on my part, or something else - perhaps you would know more?

So it does turn out that the failure to report is separate to the long diff times. Nonetheless, I'll still send you one of the JSON files - I want to make sure I remove any identifying information first, but once I've done that, I'll send it along.

christianp commented 3 months ago

I think that this is fixed, now we're using the prioritised queue.

numbas / numbas-lti-provider

LTI not reporting scores to the VLE; caused by diff_suspend_data bottleneck? #163

BEGIN "detective work"

END “detective work"