maths / moodle-qtype_stack

Stack question type for Moodle
GNU General Public License v3.0
142 stars 149 forks source link

Quiz attempt review sometimes hangs the system #706

Closed alcarola closed 3 years ago

alcarola commented 3 years ago

Thanks for the input on my other recent issue #704! I'll get to troubleshooting it soon. Now I unfortunately have a another, similar issue.

My colleague organized an electronic exam with STACK today, but when he clicks Review attempt in the quiz, the moodle server stops responding. The web browser only says "Waiting for stack.abo.fi", and in the end there is a Gateway Timeout error. I get the same experience when testing with an updated version of Chrome.

On the server side, in top, a maxima_opt_auto process maximizes CPU at around 98% and load average rises towards four.

Any suggestions how I could trouble shoot this?

Since the university has grading deadlines for teachers, this issue has higher priority than #704.

aharjula commented 3 years ago

I would ensure that the Maxima process timeout is sensible, i.e. small enough, so that the script doing the grading can run enough things to generate output before the gateway times out. With small enough time out it can break out from whatever is hoggin that particular maxima, and continue without sensibly grading that particular case. You might then try to check if the inputs received include something truly expensive to evaluate or if the question has logic that is expensive with particularly nasty input. I had some timeout issues back in #200 when that particular logic was less robust but one can easily construct inputs and logic that receive them that will run forever or just too long e.g. expanding student input and the input is something like (x+a)^6000, for those cases timeouts are critical.

You might also add some logging to the connector-class so that you would get output that says which input timed out it should help with tracking down the offending question and attempt. What you do to that attempt after that is another matter though, just deleting it is probably not an option if it is an exam, but it might be what needs to happen so someone may need to manually grade that students questions before it can be eliminated from the Moodle.

alcarola commented 3 years ago

Thanks for the fast reply!

Where do I set the timeout? I tried editing process.conf as follows:

This is the time that a process is allowed to take when starting up (ms).

startup.timeout = 1000

This is the time added to the lifetime of a process when it is taken to use

so that it wont be killed while in use (ms).

execution.timeout = 5000

This is the lifetime given to a process (ms).

maximum.lifetime = 5000

So a process should not live longer than 11 seconds. I recompiled the maximapool using ant, copied the war file into tomcat's webapps folder and rebooted the machine. Maximapool was unpacked in webapps, but still every maxima_opt_auto hogs all CPU for 30 seconds.

How do I add logging to the connector? I tried googling, but could not find anything I could use. I only found some very general technical instructions that assume a lot of background knowledge that I don't have (time to learn).

aharjula commented 3 years ago

MaximaPool timeout is nice (don't drop the maximum lifetime that low as that will not lead to a sensible pool function) and you can also tune the timeout at the Moodle end in the STACK settings as that is the one that MaximaPool will receive and try to enforce.

Every process hogs 30 seconds or just the one related to this particular use case? And that MaximaPool setup has worked before just fine? No one updated anything at all and the MaximaPool healtcheck page works just fine, STACKs also?

As to logging, I would write something that opens a file (maybe in append mode) and writes something sensible e.g. the command itself inside this little block of code: https://github.com/maths/moodle-qtype_stack/blob/master/stack/cas/connector.server.class.php#L116 That should give some idea about the types of commands leading to trouble.

alcarola commented 3 years ago

Ah, now I saw that it is in the STACK settings in Moodle that this timeout is set. I dropped it to 10 seconds and now indeed I got a STACK timeout instead of a Gateway timeout.

In fact the Review attempt view gives the information on what answer caused the timeout, so I do not need the info on logging the connector anymore; see the screenshot:

Review timeout message

alcarola commented 3 years ago

Ah, sorry, I did not see your answer before posting again and closing the issue... It seems like I have at some point set the STACK timeout to 30 seconds in Moodle. I now dropped it to the recommended 10 seconds again.

In process.conf, I now put a bit smaller values than I had before; now it is

Maximum lifetimes.

This is the time that a process is allowed to take when starting up (ms).

startup.timeout = 2000

This is the time added to the lifetime of a process when it is taken to use

so that it wont be killed while in use (ms).

execution.timeout = 10000

This is the lifetime given to a process (ms).

maximum.lifetime = 60000

aharjula commented 3 years ago

maximum.lifetime = 60000 is still a bit low to my taste but it should do. That setting is the time the pool keeps a process alive in the pool before it replaces it with a new one. Its primary purpose is to allow the pool to scale back after a load spike causes it to grow. In my opinion it is too soon to cut down the pool size after just one minute since the load that caused new processes to be prepared. The default if I remember correctly is 10 minutes that should be enough to keep the pool ready for the next actions of the recent users.

Anyway, good that the factorial was easy to find. Those truly lead to trouble so the design of the question, if they are included, is critical and in some cases I think it would be better to just forbid the whole operator to remove the risk of malicious input.

alcarola commented 3 years ago

Oh, thanks! I forwarded the insight that the factorial was the culprit to the teacher using the question.

I also increased the pool scale-back parameter to 10 minutes, and made a note to myself that this is the purpose of this particular parameter.