Ways of handling Maxima timeouts

aacchhee commented 2 years ago

We have a STACK question where the students have to input a list [m,n]. This list is then parsed in PRT as ans7[1] and ans7[2] with subsequent AlgEquiv on both. Correct answer is [25,10].

Some students have been creative and provided answers [25>10] (probably keyboard layout mismatch) and [R(50,20), R(75,30)] (probably misunderstood something). After providing these answers they logged out and left for lunch. When they came back, the cannot open the same attempt as it times out with an error 504 gateway timeout. I get the same error when I am trying to review these attempts.

Our support was able to go deeper and found that it was these answers timing out Maxima and thus timing out everything. Should not there be a more correct way of handling such timeouts?

aharjula commented 2 years ago

First of all, could you describe the STACK installation in a bit more detail:

How recent is the version of STACK?
Is MaximaPool or goemaxima in use, or is the Maxima running locally on the web server itself?
Is this Moodle or ILIAS?
Does the server run on Linux or some other *NIX?

This sounds familiar, but the familiar thing is from very far in the past. If it were that familiar thing, one should check that Moodles cron job executes correctly.

This is the theory I would start with: In general, you tend to set a timeout for Maxima in the STACK settings and that should be fine, but you need to make sure that that timeout is not too large in relation to the PHP and other timeouts of the actual server or any proxys in front of it. It could be that your question just runs N commands that all take the full Maxima timeout, and N times the Maxima timeout is actually larger than the timeout of other parts of the system. Those other timeouts can be a problem if all the processing happens before anything gets outputted, and that is what happens during that evaluation of those N things...

One can use fairly aggressive means to ensure that local Maxima processes do not take too long. But even then, one needs to consider the overall timeouts of the system.

Anyway, a random comment about the timeout setting. To me, setting it as anything larger than 5 seconds in production is asking for trouble. If one needs it to be larger, one is probably doing something excessive in the question logic, or the machine in use is too weak. Ideally, one would set it lower, but then random load spikes could be a problem. With MaximaPool and others, one can better control the timeout at the other end and dealing with load is simpler.

hallvabo commented 2 years ago

1) STACK 4.4.0 2) Locally, as Maxima cpu usage has never been much, even in peak hours 3) Moodle 3.11.9 4) Linux (optimized), with compiled maxima image maxima timeout is the default 10 sec, with optimized maxima command "timeout --kill-after=12s 12s"

The STACK question has 8 input fields, each with its own multi-nodal PRT. The student's answer which causes the problems was given as [25>10] in input field 8, and is only referenced in PRT8. After the page finally loads, feedback for PRT1-7 gives the following error: "PRT generated the following runtime error: TIMEDOUT"

After increasing nginx timeout, the review page for the attempt finally loads after 3min 40sec. But how can a single student answer that will take maximum 10 seconds to evaluate in Maxima cause this? Is the student answer evaluated many times (with no caching), and in each of the PRTs (even if not referenced there)?

aharjula commented 2 years ago

Unless one turns off caching, all things being sent to Maxima will be cached as long as they return something sensible. Timeout is not sensible, so it will try to execute again if the same things are asked for.

It could be that one of the performance improvements of 4.4 is now leading to a more complex breakage. Basically, pre 4.4 each PRT was evaluated in a separate Maxima session, but now we evaluate them all (all that have valid inputs available) in a single session. The problem is that if that times out, then the loop that was requesting for a single PRTs result will probably keep on looping and asks again for the next PRT, and all the PRTs are again tried to be executed in bulk. So having something timeout anywhere will break everything.

It really should not do that, i.e. I need to check that the thing breaks on the first timeout. On the other hand, AlgEquiv should not time out with that type of comparison, so it would be interesting to see what exactly happens there. @sangwinc Chris, can you imagine some reason why AlgEquiv, in particular, should react that way?

That would explain about 80s (8 PRTs, all active), but you are saying that it took 220s. Moodle does evaluate and render in different passes, so we could get up to 160s with that and that is not that far but still, something else must also be broken.

aacchhee commented 2 years ago

STACK-trouble-with-lists.zip

Here comes the problematic question if someone needs it.

aacchhee commented 2 years ago

And here is the attempt review. The only difference is in ans8.

hallvabo commented 2 years ago

By reducing the CAS timeout, the page load time scales down accordingly. With 5s CAS timeout it takes 111 seconds, and with 3s CAS timeout it goes down to about 67 seconds.

aharjula commented 2 years ago

So basically, the timeout logic does work, and things do "time out", but if the timeout limit is high and if there are many things that get retried too many times, the repeats will take time.

What I think I'll do is that this function should not retry timeouts within the same requests handling as it does the evaluation of the whole "forest of PRTs" and is being called by quite a lot of places leading to the failure being multiplied. Other than that, the root issue of the evaluation of that particular input still needs to be tracked down.

aharjula commented 2 years ago

@sangwinc in that particular case the compiled PRT should probably exit after detecting (does not check for that currently) the error happening in the feedback variables. However, it does not do so and continues with the nodes, some of which now don't have sensible arguments, i.e. we end up working with unbound variables and comparing them to things.

In this case, the error happens when trying to extract the second element from a list with only one element. The question itself should probably have a node at the start of that PRT checking if the answer has the correct number of elements and that extraction of the elements might be better done inside a check for that very same thing.

aacchhee commented 2 years ago

When it comes to the question itself and situations when we do know the number of elements in the list, I would expect that the number of elements is checked at the validation stage -- in the same way we check formulas with allowed/forbidden symbols.

aharjula commented 2 years ago

The problem is that then we would need to separate cases where we want a list of a specific size from cases where the list's length varies, and that would mean that the input-field would need to have options for describing that difference. In general, the algebraic input type is probably not the best target for this as if it would contain all the possible styles of input and options for defining all the possible fixed or varying dimensions, it would quickly drown in options. So it would be better to have a separate input type for lists of fixed or varying lengths. Closest we currently have are the matrix input types, we have no list specialised input types.

Currently, the algebraic input type is able to check for the type of the input, i.e., you could make the validation check that it is a list (i.e. is it of the same type as the teacher's answer), but it does not check the length of that list. Maybe in the future input2 system, we could check for dimensions, but even then, I would imagine using a matrix input would be the way to go as that input type would remove the need for a separating comma or brackets. The current prototype even allows one to control any brackets around the matrix so that it can be made to look like something else than a matrix if need be.

maths / moodle-qtype_stack

Ways of handling Maxima timeouts #856