openwebwork / webwork2

Course management front end for WeBWorK
http://webwork.maa.org/wiki/Main_Page
Other
145 stars 165 forks source link

LTI mass update via Mojo::IOLoop causes server response issues #1841

Open somiaj opened 1 year ago

somiaj commented 1 year ago

I have been testing out LTI mass update on a full class with multiple assignments with mojolicious and noticed the following:

I have tried looking into Mojo::IOLoop to see what is going on, but not finding any useful information as to why that first request after the mass update starts would be waiting for the update to finish, yet any requests after that seem to process just fine.

drgrice1 commented 1 year ago

With morbo that will happen. Morbo only runs one worker, and so latter requests have to wait until the first worker finishes. That is why it is for development only.

With hypnotoad, that should not happen. I will look into it.

drgrice1 commented 1 year ago

Testing things with a generic Mojolicious app and a timer that just sleeps for 10 seconds, I see that everything is the same with that.

For morbo, as I said, the process is occupied and until the timer completes no further requests can be processed. That is to be expected for a development server.

For hypnotoad, when the timer starts it seems that the next request after that from the same client tries to use the same worker. Sending another request seems to move to another worker, and all is good for that request. Note that a request from another browser seems to go to another worker anyway and is fine.

In this generic app, I also tested forking the process inside the timer (using Mojo::IOLoop::Subprocess), and that fixes the problem both for morbo and hypnotoad. Both are able to continue to serve requests even with the same worker. However, this does not seem to be an option for webwork2 and the LTI mass_update method. The problem is that the database connection gets interrupted when a new request comes in, and everything explodes. The mass_update fails with database errors, and the new request dies. So far worse than the current situation.

For now, we are just going to have to live with how it is I think. The user that triggers the mass update, will either need to wait, or click on a link a second time. I suspect that in production with more entropy in the system (more users making requests and such), that this might actually be better as worker processes will be rotated more frequently.

taniwallach commented 1 year ago

The current approach to triggering and running a mass-update for LTI essentially hijacks the server thread (when the request is being shut down via cleanup_register) when the code decides to actually do an actual mass update. Since the update involves many single LTI transactions to the remote LMS, it is by nature a task which is quite time-consuming. However, there is no real need for the mass-update to run all that quickly or for there to be any response shown "live" in response to the request which triggered it.

I recommend considering whether the "work" of a mass-update can be offloaded to really be run "externally" from the main event loop.

My first thought was whether this could be best handed off to an external script to do, and for the script to be run in the sub-process.

After some reading, it looks like this is something which might be better suitable to be handled via a task submitted to a Minion job queue. See: https://docs.mojolicious.org/Minion and https://docs.mojolicious.org/Mojolicious/Plugin/Minion.

Then all the time-consuming processing would be then be totally offloaded from the regular web-server. Log reports of success/errors could be logged to a new log file for the relevant course. The "external" code (Minion task or script) doing the actual mass-update should open and close its own connection to the database (or at least use a connection not being used by one of the regular workers).

The decision of when an actual mass_update should be triggered could just be made either directly in the go method of lib/WeBWorK/ContentGenerator.pm or by just running the initial portion of the current mass_update subroutine.


Sending emails in lib/WeBWorK/ContentGenerator/Instructor/SendMail.pm could also probably benefit from being split off to be handled as an "external" task and not kept inside the main process.


For future thought - the entire mass_update code seems to exist to try to overcome the problem of failed grade pass-backs. At present, each mass-update will send the LMS all the scores even if they have not been changed after the server was able to last successfully send grades to the LMS. If the database recorded a boolean LMS_was_updated for at the homework set / user level and set it to true after a successfully grade passback to the LMS (possibly confirming with the LMS that the score can be retrieved) and cleared that field whenever a grade change was saved, then far fewer scores would actually need to be sent to the LMS when a mass-update was triggered. We could provide a mechanism to allow staff to force a "full mass-update" either by making it possible to clear the LMS_was_updated field course-wide or provide an course management link to trigger a mass-update which would ignore the value of LMS_was_updated

somiaj commented 1 year ago

For future thought - the entire mass_update code seems to exist to try to overcome the problem of failed grade pass-backs. At present, each mass-update will send the LMS all the scores even if they have not been changed after the server was able to last successfully send grades to the LMS. If the database recorded a boolean LMS_was_updated for at the homework set / user level and set it to true after a successfully grade passback to the LMS (possibly confirming with the LMS that the score can be retrieved) and cleared that field whenever a grade change was saved, then far fewer scores would actually need to be sent to the LMS when a mass-update was triggered. We could provide a mechanism to allow staff to force a "full mass-update" either by making it possible to clear the LMS_was_updated field course-wide or provide an course management link to trigger a mass-update which would ignore the value of LMS_was_updated

I am currently working on a page that an instructor can trigger a mass update of all grades, or just a selected user and/or set (I often find after modifying an assignment or manually grading one, to want to just update a single set vs all of them). In working with this is how I originally ran across this issue. I'll make a draft pull request soon so others can look at it.

Though it would be nice to have some system that can just be triggered, and also one that can be monitored to if it is running or not, this way new updates can't get triggered until the current one has finished.

drgrice1 commented 1 year ago

I am familiar with Minion, and have already considered it. Unfortunately, Minion requires postgres. That is the only implemented backend at this point. That is something that webwork2 can not depend on.

taniwallach commented 1 year ago

@drgrice1 Sorry - I'm not familiar with it, just found it while trying to consider a better approach to how mass updates can be handled. However, I did recall seeing mention of a SQLite backend in a blog post at https://thecurioustechnoid.com/mojolicious-minion-a-high-performance-job-queue/ and Google found the module https://metacpan.org/pod/Minion::Backend::SQLite .

drgrice1 commented 1 year ago

Odd, now that I do a Google search for "Minion backends" I find that as well as https://metacpan.org/pod/Minion::Backend::mysql. The searches I have done before have all had "Mojolicious" in them, and I guess that was to narrow.

I any case, with one of those backends, we can probably use Minion. Note, that this will add some additional setup though. Minion is run as a separate app from the Mojolicious app, and so another systemd service file will be needed to control that (or similar such mechanism). This could perhaps be optional for those that need the functionality provided. When I get a chance, I will implement it.

taniwallach commented 1 year ago

I have been thinking more about the underlying problem, and not just what we can do to fix LTI mass update.

If I understand things correctly after looking at https://docs.mojolicious.org/Mojo/Server/Hypnotoad it seems that the issue with LTI mass_update is that it is a blocking operation. Since Mojolicious/hypnotoad allow a worker to process several requests concurrently (the clients setting), when the code of one request gets into a long blocking operation - the worker is just "stuck" from the perspective of the request being processed as well as any later requests allocated to the same worker until the relevant section of blocking code completes.

AFAIK This is quite different from the manner in which requests are assigned to Apache worker. AFAIK in Apache, the workers would not have accepted a new request for processing until the prior one had been completed, which is why in the past we did not need to worry about latency for additional requests being caused by a small number of requests with long processing times, so long as other workers were available reasonably quickly.

If this conclusion is correct, any value of the clients setting over 1 will potentially allow "new requests" to be assigned to a worker which is caught up in a "long" blocking operation, which may create very noticeable lags in the responsiveness of the system even under moderate load, as some requests will end up assigned to workers busy on a long blocking request.

Given the fact that essentially all the webwork codebase was written under the assumption that the "worker" running the code was not going to be attempting to processes multiple requests concurrently, I don't think we can quickly redesign the WeBWorK code to be more supportive of the potential for concurrent handling of requests. As such, it would seem that WeBWorK as a hypnotoad app should probably be configured to use a very low value of clients (possible just 1) and a higher value of workers to provide a mechanism to handle multiple requests in parallel.

However, it does seem that we should try to locate the most likely places for significant delays due to inherently slow operations (such as the network operations in LTI grade-passback) and try to refactor such parts of the codebase to either run the code as a background task (if the task is really slow) or to use asynchronous coding practices to allow concurrent request processing while the slow request is being processed.

My impression is still that LTI mass update is probably is case where handing off the work to a background task is the better solution.


I think the idea of using Minion for "slow tasks" could probably be of great benefit to some of the other "slow" things, as they could also trigger similar latency on additional requests assigned to the same worker.

@drgrice1 Your comments in conf/webwork2.mojolicious.dist.yml hint that relatively large timeout values were set to try to prevent failures of "long" operations (like assigning a set to all users). I suspect that in most cases, such "long" operations are not really things which truly need to complete immediately, and could be handed off to be processed as background tasks.

Other "slow" parts of the system I can think of:

Any such slow operations which can be offloaded into external tasks would keep the main server more available to process "regular" requests. Use of high priority queues would hopefully mean that we could be reasonably confident that the more urgent tasks would be run reasonable soon after the task was queued for processing, but allow the main server to quickly complete the processing of the "main" request without keeping a worker tied up while the slow process was running.

Minion supports using "spare" fast-lane high-priority queues (https://dev.to/kraih/high-priority-fast-lane-for-the-minion-job-queue-4711) for backends which implemented the necessary changes, including the SQLite backend (https://github.com/Grinnz/Minion-Backend-SQLite/commit/cc67ef907b3a604adfc0970dd7a976685b085ee8). From what I can tell, the MySQL backend does not seem to support the min_priority feature. That might be an significant advantage to using the SQLite backend (unless we want to forking the MySQL backend and add the feature).


There is a Minion plugin for Mojolicious. Maybe that means that it does not need to be treated as a fully separate app. See: https://docs.mojolicious.org/Mojolicious/Plugin/Minion . However, from https://docs.mojolicious.org/Minion/Guide it seems that even with the plugin there is a need to take additional steps to start a Minion worker, so in any case there are certainly additional things to start up.

The ideas in https://github.com/openwebwork/webwork2/pull/1842 could take advantage of the option to set priorities for tasks sent to Minion, so that "smaller" LTI updates would get processed before bigger ones.


I'm not sure how using Minion will work with running WeBWorK in a Docker container, but I assume that docker-entrypoint.sh can probably be used to start up the Minion worker in the same container. Even if running multiple things in a single Docker container is not something typical recommended, in this case it seems relatively reasonable to me.

drgrice1 commented 1 year ago

@taniwallach: A couple of quick notes on your comments.

Scoring and PDF generation do need an immediate response. So those won't work in an external task as you suggest. Also, they generally are not that slow.

Assigning a set to all users is not a problem in a general setting where you only have a few hundred users. It is only a problem when you have a few thousand users. An extreme case.

taniwallach commented 1 year ago

@drgrice1 - Thanks for considering the ideas and questions I am raising. My intention is to raise the issues which seem to arise from the change to Mojolicious/hypnotoad long before problems are likely to occur on production systems.


I think my main conclusion from the discussion thread and my reading is that we should take it for granted that webwork2 is currently inherently a blocking application so that we need to set clients=1 in the hypnotoad configuration and use suitably large number of workers based on the system resources. That is also the recommendation given at https://github.com/mojolicious/mojo/wiki/Benchmarking.

As we discussed, it does seem that the types of operations most likely to run a risk of exceeding timeout limits are certain instructor operations. Thus, clients=1 should vastly reduce the risk of students encountering huge response time issues.

I created https://github.com/openwebwork/webwork2/pull/1848 as a result of this conclusion.

Before we release WW 2.18 based on Mojolicious and hypnotoad we should try to provide some guidance on how to tune the various hypnotoad settings.

  1. timeout related settings based on course sizes or the expectation of slow operations,
  2. number of workers as a function of RAM available.

Even if we can only provide very rough guidelines at the start - it would be better than nothing, and we can include an explicit request for people to post feedback about their settings, resources, and performance in the forums.


We should try to be very aware of what types of requests can potentially block for a significant amount of time, and try to understand the implications of the request processing being "killed" before the expected processing is completed if the time limit (heartbeat_timeout + graceful_timeout?) is exceeded.

For example, in some testing I did on develop with a large course, I ran into cases where "assign set to all students" did not complete, the response page was not sent and some users did not get assigned the set. I worry that in such cases, the problem could even have been worse - the last user being assigned the set might have been assigned a "broken" set if the hypnotoad worker was killed in the middle of the various problem records being added to the database.

That motivated the work on https://github.com/openwebwork/webwork2/pull/1846 which seems to make a very significant difference to the speed of "assign set to all/multiple users".


Maybe we can find a mechanism whereby a hypnotoad worker will be able to issue a new heartbeat in the middle of processing some of the longer requests. This is far out of what I understand yet about Mojo's asynchronous processing support so I have no idea how we can "hack" in some type of "await" to let the Mojo event loop take over briefly. If that is possible, we can avoid the possibility of very long processes from being killed by adding code to trigger such "awaits" in suitable places.

Where possible, I do think that potentially "very slow things" should move into "external tasks" even if that significantly changes how things work/are done at present. In suspect that approach is more dependable and will be less disruptive to the server running well than just increasing the various timeout settings to be "large enough".

Based on the post at https://groups.google.com/g/mojolicious/c/BxnTz5Na3pI/m/alFlFnjHShcJ hypnotoad tries to tell a work to shutdown gracefully using a SIGQUIT. One thing which might help avoid allowing really bad things from happening would be to try to check for and handle SIGQUIT signals in such a manner that we could detect having received a SIGQUIT inside a potentially long loop and decide to "stop" processing at a "good" location.

I have not looked into signal handling in Perl before, but Google found sample code using something like local $SIG{QUIT} = sub { code_to_run; }' to set up a somewhat scoped signal handler. See https://stackoverflow.com/a/18427691/18664912


The documentation and release notes will need to make it clear what sort of actions may run into trouble if the hypnotoad timeout values are too low. To whatever extent is possible - it might help if the relevant pages in the instructor tools included a clear warning that failures should be taken as an indication that tuning of the server timeout settings is probably necessary.


I don't think we should treat a course with "a few thousand users" as an extreme case which we can ignore for now. I certainly have several courses a semester with close to 1000 students in them, and would not want to end up running into significant problems running such courses on WW 2.18 when it is ready for production use.


I am aware that the current code flow of scoring and PDF generation expect an immediate response. However, they could probably be reworked to have a "file being prepared" response, and then later check to see when the file is ready.

Scoring is certainly an operation whose anticipated runtime will be roughly linear in the class size, so I believe that run-time can certainly become significant in large courses. Maybe the database access for scoring can be improved to make it faster. I did not look yet at how it is handled.

PDF generation is probably less of a problem when used by students. Staff can request hardcopy for multiple students in a single file, so the generation time can certainly become significant. That is something where a "await and allow a heartbeat" solution might be very helpful.


I think we have 2 motivations to take into account when deciding how to deal with these types of issues.

  1. Requests (in particular those which modify the database tables or which generate scoring data) should reliably run to completion.
    • Anything whose runtime will be linear in either the number of students or the number of assignments should be considered as potentially problematic.
  2. Typical student interactions with the system should not experience significant delays/latency when some small number of "slow requests" are being processed. (To be solved by using clients=1.)

In production (WW 2.17) I rarely see response times (wall time as the browser developer tools measure it) for opening problems or submitting answers of more than 2-3 seconds. Anything which can cause the response time for such a request to exceed about 6 or 7 seconds will be quite noticeable to students and pretty bad for the end user experience. We should try to avoid such occurring when the server has available CPU/RAM resources.

However, given the test results below, it seems that the manner in which hypnotoad will kill requests which are not completing "fast enough" means that some requests will not run to completion if they take too long. I saw that for large "assign set to all students" requests when the server was under stress, some of the students did not get the set assigned to them.

https://github.com/openwebwork/webwork2/pull/1846 should help reduce the risk of this, but essentially probably only makes the problem one which applies only to significantly larger courses.

I did some testing on my laptop via a Docker based WW server in a course with about 850 users (copied from a real course). I created 12 sets of 25 problems each. (Somewhat large, but there are some assignments of that length in the courses I have in production, and the large size was intended to test behavior under load).

A single assign to all of such a set in this course took 42-60 seconds (several single runs were done). When I tried the same "assign to all" using the WW 2.17 Docker image (on the same hardware), it took only 28-30 seconds.

When I tried to run 10 such in parallel, the first 6 took about 100 seconds each (in parallel) and the other 4 took about 160 seconds each. It could be that to some extent the database side started to be a bottleneck.

In a second round of trying to assign all users to the same 10 assignments (after unassigning all users) - some of the pages failed to load (no response was received) and for those sets - some users did not get the set assigned to them. All the pages which did load took over 2.4 minutes to load.

While the server was "busy" handling 10 parallel "assign to all" requests nothing else would respond, which is to be expected with workers set to 10 and all being busy.

Note: Using the code from https://github.com/openwebwork/webwork2/pull/1846 (as of the version there now) gets the runtime for the same tests to 13-15 seconds.

Scoring all the sets in that course took about 14 seconds.

drgrice1 commented 1 year ago

We probably are going to need to set clients to 1. However, it would be better to try and identify where things are blocking, and try to make them non-blocking if possible.

Using a SIGQUIT handler is not a good idea. In general any type of system signal handler is not a good idea. Trying to interfere with the way those are supposed to work usually just leads to more problems and system instability.

We should treat a course with a large number of users (like a thousand or more) as an extreme case (I never said a case to ignore though). You are thinking about your situation, and not the most common situations for WeBWorK installations. The configuration should be set to what the most common situation is, not the outlier case. Of course documentation (in comments and install notes) should mention how to adapt these settings for those outlier cases. Note that the current documentation does not do a good job of covering the details on this for the current mod_perl2/apache deployment. Part of this is because it is a complicated issue. Largely it comes down to the administrator either having the knowledge or doing to the research to set things up correctly for the particular installation. It may be that this is really a case of providing tips on where to find the information needed to correctly set up your server, rather than trying to repeat what is documentation elsewhere also.