turnkeylinux / tracker

TurnKey Linux Tracker
https://www.turnkeylinux.org
70 stars 16 forks source link

Canvas - issues sending emails - task runners crashing with "out of memory" errors when there is plenty of free memory #1979

Open JedMeister opened 1 month ago

JedMeister commented 1 month ago

Update: Proposed "fix" commented out as it does not seem to make any difference?! My colleague assures me that it was working (at least in part) however my testing suggests that it makes no difference.

Our latest v18.x Canvas has a number of known issues that we're working to resolve. AFAICT all the issues we are investigating are all related to the background task runners crashing.

The issues that have been reproduced related to background task runners crashing are:

Other issues that have not been directly confirmed but appear to be related are:

As some background to the apparent cause of the issue; when any action is triggered in Canvas (e.g. sending an email, uploading a file and most changes made in the UI) the action is added to a background queue. When operating correctly, a background service initiates a task runner process to action the next job on the background job queue.

In our current Canvas release, the background service is running ok, but the individual task runners are crashing and not completing their tasks, leaving the jobs in the queue. The task runners die with an error message to the effect of "out of memory" - when there is plenty of free system memory.

We thought that we had developed a solution which seemed to resolve the issues that had been confirmed, e.g. emails starting being sent. However after testing the "fix" on multiple servers over numerous reboots, it became clear that it just changed the nature of the front end error(s) and just reduced the incidence of the task runner crashes. It didn't actually stop them occurring altogether. Intermittent task runner crashes (with the same memory error message) were still occurring.

The "fix" we developed/discovered was applying an undocumented DB migration. As noted it appears to not be a complete fix, but it can be applied (as root) with the following commands:

systemctl stop canvas_init
systemctl stop apache2

cd /var/www/canvas
RAILS_ENV=production bundle exec rake switchman_inst_jobs:install:migrations

systemctl start canvas_init
systemctl start apache2

The issue (at least after the "fix" has been applied) is intermittent for at least some cases and appears to be some sort of race condition. Unfortunately because the issue is intermittent and the only specific error message I've seen seems to be a red herring, it's particularly difficult to isolate the cause.

I have asked one of my colleagues to investigate the issue further, but so far we have had no progress. I plan to rebuild our Canvas server from scratch and carefully document the issue on a fresh server ASAP. After confirming that it is nothing we're overlooking on our end, I will lodge a bug report upstream.

JedMeister commented 1 month ago

I am still having issues sending emails, but I have got the background task runners running reliably. The issue is that the delayed job runners were running out of memory and crashing.

To resolve that, edit the /var/www/canvas/config/delayed_jobs.yml config file and update the value for worker_max_memory_usage to 1073741824 (the default is 536870912). Be careful not to change the leading spaces on that (or any other) line as YAML files are white space sensitive.

Once you're done, the updated line should look like this:

  worker_max_memory_usage: 1073741824

Then restart the service (it's likely not required to restart apache, but best to do it anyway when there have been Canvas config changes):

systemctl restart canvas_init apache2

Unfortunately the emails still don't seem to be sending though?! :( I'll continue on this tomorrow...

JedMeister commented 1 month ago

FWIW I have confirmed that the host can send emails successfully. Canvas itself is not sending the emails.

The background tasks are running - and being successfully processed. There are no errors noted in the delayed_jobs log nor in the UI error log. No email jobs are showing in the jobs queue.