numbas / numbas-lti-provider

An LTI tool provider to run Numbas exams
https://docs.numbas.org.uk/lti/en/latest/
Other
11 stars 12 forks source link

Diffed SCORM elements can get out of order #194

Open christianp opened 2 years ago

christianp commented 2 years ago

We have one case of an attempt's cmi.suspend_data SCORM elements getting incorrectly diffed.

The intention is that older elements are stored as diffs against newer elements, so only the newest element is stored in full. The affected attempt had one element stored as a diff against the oldest element.

The relevant method is numbas_lti.models.diff_scormelements. It relies on the default ordering of the ScormElement model, which puts newer elements first.

The change is done as an atomic transaction, so if the process runs in two workers simultaneously there shouldn't be a race condition.

I can't see how this happened. The oldest element would have to look like it had no diff.

mjmq100 commented 2 years ago

We have a few cases that look like this (three attempts out of about 900). They showed up because trying to review the attempt fails but creates a long-lived process on the server. These processes accumulate and eventually the server stops responding.

I think there might be an issue with overlapping calls to diff_scormelements, despite the transaction. Suppose one process sees an undiffed attempt and therefore selects all its scormelements for processing. (None of them have a diff.) Just after that database query, and before sorting the elements, another process commits a transaction that diffs the same attempt. Now all the scormelements except the oldest will have a diffs. The continuing process now sorts the oldest element to the front of the list as the only one where hasattr(x,'diffs') is False. It then creates an unresolvable loop of diffs. I don't think being in a transaction will stop this under Django's default isolation level.

I can reproduce this under a debugger, although I don't know whether the debugger might be altering the behaviour. I could easily be misunderstanding the code (and how transactions work!), but I think the issue is something along these lines. If the filtering and sorting were all done with one database query, that would resolve the particular case that I found, but I'm not confident that it would prevent similar issues in the rest of the function.

christianp commented 2 years ago

@mjmq100 thanks for that. It sounds like a good explanation of what goes wrong. I'll try to work out how to avoid it.