Open christianp opened 2 years ago
We have a few cases that look like this (three attempts out of about 900). They showed up because trying to review the attempt fails but creates a long-lived process on the server. These processes accumulate and eventually the server stops responding.
I think there might be an issue with overlapping calls to diff_scormelements
, despite the transaction. Suppose one process sees an undiffed attempt and therefore selects all its scormelements for processing. (None of them have a diff
.) Just after that database query, and before sorting the elements, another process commits a transaction that diffs the same attempt. Now all the scormelements except the oldest will have a diffs
. The continuing process now sorts the oldest element to the front of the list as the only one where hasattr(x,'diffs')
is False. It then creates an unresolvable loop of diffs. I don't think being in a transaction will stop this under Django's default isolation level.
I can reproduce this under a debugger, although I don't know whether the debugger might be altering the behaviour. I could easily be misunderstanding the code (and how transactions work!), but I think the issue is something along these lines. If the filtering and sorting were all done with one database query, that would resolve the particular case that I found, but I'm not confident that it would prevent similar issues in the rest of the function.
@mjmq100 thanks for that. It sounds like a good explanation of what goes wrong. I'll try to work out how to avoid it.
We have one case of an attempt's
cmi.suspend_data
SCORM elements getting incorrectly diffed.The intention is that older elements are stored as diffs against newer elements, so only the newest element is stored in full. The affected attempt had one element stored as a diff against the oldest element.
The relevant method is
numbas_lti.models.diff_scormelements
. It relies on the default ordering of theScormElement
model, which puts newer elements first.The change is done as an atomic transaction, so if the process runs in two workers simultaneously there shouldn't be a race condition.
I can't see how this happened. The oldest element would have to look like it had no diff.