sighmon / mjml-rails

MJML + ERb/Haml/Slim view template
https://mjml.io
Other
295 stars 64 forks source link

SIGKILL (signal 9) on many concurrent emails #96

Open DonGiulio opened 2 years ago

DonGiulio commented 2 years ago

Hello,

in our prod platform.sh we are getting tons of errors saying (process status: pid 96294 SIGKILL (signal 9))

I checked the source of the GEM and managed to track down the issue to something to do with Open3.

Where basically running many times concurrently mjml becomes really slow (taking many minutes to complete), and I believe this takes Open3 to kill the tasks for some reasons.

here's how I managed to reproduce the issue on a platform.sh PaaS environment, linux based:

results = []
pids = []
times = {}
40.times.each do |i|
  infile = "/tmp/in_#{i}.mjml"
  command = "/app/node_modules/.bin/mjml -r #{infile} -o /tmp/out_#{i}.html"
  `cp /tmp/in.mjml #{infile}`
  puts "forking #{i}"
  times[i] = Time.now
  pid = Process.fork do
    pids << pid
    result = Open3.capture3(command)
    results << result
  ensure
    puts "exited #{i}, result: #{result}, took #{Time.now - times[i]}"
  end
ensure
  `rm #{infile}`
end

puts "parent, pid #{Process.pid}, waiting on child pids #{pids}"
Process.waitall
puts "parent exiting"

hangs for a few minutes on the waitall command, and then it outputs:

something like: [[15687, #<Process::Status: pid 15687 SIGKILL (signal 9)>], [15700, #<Process::Status: pid 15700 exit 0>], [15714, #<Process::Status: pid 15714 SIGKILL (signal 9)>], [15774, #<Process::Status: pid 15774 SIGKILL (signal 9)>], [15817, #<Process::Status: pid 15817 SIGKILL (signal 9)>], [15895, #<Process::Status: pid 15895 SIGKILL (signal 9)>], [15903, #<Process::Status: pid 15903 SIGKILL (signal 9)>], [15942, #<Process::Status: pid 15942 SIGKILL (signal 9)>], [15961, #<Process::Status: pid 15961 SIGKILL (signal 9)>], [15969, #<Process::Status: pid 15969 SIGKILL (signal 9)>], [15977, #<Process::Status: pid 15977 SIGKILL (signal 9)>], [16013, #<Process::Status: pid 16013 SIGKILL (signal 9)>], [16027, #<Process::Status: pid 16027 SIGKILL (signal 9)>], [16037, #<Process::Status: pid 16037 SIGKILL (signal 9)>], [16044, #<Process::Status: pid 16044 SIGKILL (signal 9)>], [16196, #<Process::Status: pid 16196 SIGKILL (signal 9)>], [16215, #<Process::Status: pid 16215 SIGKILL (signal 9)>], [16020, #<Process::Status: pid 16020 SIGKILL (signal 9)>], [16034, #<Process::Status: pid 16034 SIGKILL (signal 9)>], [16160, #<Process::Status: pid 16160 SIGKILL (signal 9)>], [15882, #<Process::Status: pid 15882 SIGKILL (signal 9)>], [15922, #<Process::Status: pid 15922 SIGKILL (signal 9)>], [15934, #<Process::Status: pid 15934 SIGKILL (signal 9)>], [16000, #<Process::Status: pid 16000 SIGKILL (signal 9)>], [16176, #<Process::Status: pid 16176 SIGKILL (signal 9)>], [15675, #<Process::Status: pid 15675 SIGKILL (signal 9)>], [15718, #<Process::Status: pid 15718 SIGKILL (signal 9)>], [15731, #<Process::Status: pid 15731 SIGKILL (signal 9)>], [15744, #<Process::Status: pid 15744 SIGKILL (signal 9)>], [15756, #<Process::Status: pid 15756 SIGKILL (signal 9)>], [15791, #<Process::Status: pid 15791 SIGKILL (signal 9)>], [15804, #<Process::Status: pid 15804 SIGKILL (signal 9)>], [15826, #<Process::Status: pid 15826 SIGKILL (signal 9)>], [15843, #<Process::Status: pid 15843 SIGKILL (signal 9)>], [15852, #<Process::Status: pid 15852 SIGKILL (signal 9)>], [15864, #<Process::Status: pid 15864 SIGKILL (signal 9)>], [16105, #<Process::Status: pid 16105 SIGKILL (signal 9)>], [16132, #<Process::Status: pid 16132 SIGKILL (signal 9)>], [16142, #<Process::Status: pid 16142 SIGKILL (signal 9)>], [16099, #<Process::Status: pid 16099 SIGKILL (signal 9)>]]

I initially thought this was caused by a timeout, but I also got the same bunch of sigkills after just a few seconds from starting the pool.

I believe that I'm seeing here the issue we are having in prod. Where trying to send several emails at the same time (sidekiq workers) causes them to fail and be sigkilled

sighmon commented 2 years ago

@DonGiulio Is this any help? https://github.com/sighmon/mjml-rails/pull/95/files

DonGiulio commented 2 years ago

hello sorry for taking so long to respond,

I tried using the gem from the github pull request,

with this I get to wait a very long time to get the html files ready, (over half an hour)

as a result several of the tasks were still killed:

=> [[1351, #<Process::Status: pid 1351 SIGKILL (signal 9)>], [1363, #<Process::Status: pid 1363 SIGKILL (signal 9)>], [1379, #<Process::Status: pid 1379 SIGKILL (signal 9)>], [1387, #<Process::Status: pid 1387 SIGKILL (signal 9)>], [1459, #<Process::Status: pid 1459 SIGKILL (signal 9)>], [1470, #<Process::Status: pid 1470 SIGKILL (signal 9)>], [1487, #<Process::Status: pid 1487 SIGKILL (signal 9)>], [1494, #<Process::Status: pid 1494 SIGKILL (signal 9)>], [1500, #<Process::Status: pid 1500 SIGKILL (signal 9)>], [1567, #<Process::Status: pid 1567 SIGKILL (signal 9)>], [1582, #<Process::Status: pid 1582 SIGKILL (signal 9)>], [1586, #<Process::Status: pid 1586 SIGKILL (signal 9)>], [1632, #<Process::Status: pid 1632 SIGKILL (signal 9)>], [1638, #<Process::Status: pid 1638 SIGKILL (signal 9)>], [1746, #<Process::Status: pid 1746 SIGKILL (signal 9)>], [1398, #<Process::Status: pid 1398 SIGKILL (signal 9)>], [1341, #<Process::Status: pid 1341 SIGKILL (signal 9)>], [1764, #<Process::Status: pid 1764 SIGKILL (signal 9)>], [1513, #<Process::Status: pid 1513 SIGKILL (signal 9)>], [1554, #<Process::Status: pid 1554 SIGKILL (signal 9)>], [1641, #<Process::Status: pid 1641 SIGKILL (signal 9)>], [1739, #<Process::Status: pid 1739 SIGKILL (signal 9)>], [1447, #<Process::Status: pid 1447 SIGKILL (signal 9)>], [1721, #<Process::Status: pid 1721 SIGKILL (signal 9)>], [1543, #<Process::Status: pid 1543 SIGKILL (signal 9)>], [1317, #<Process::Status: pid 1317 SIGKILL (signal 9)>], [1709, #<Process::Status: pid 1709 SIGKILL (signal 9)>], [1699, #<Process::Status: pid 1699 SIGKILL (signal 9)>], [1592, #<Process::Status: pid 1592 SIGKILL (signal 9)>], [1609, #<Process::Status: pid 1609 SIGKILL (signal 9)>], [1310, #<Process::Status: pid 1310 SIGKILL (signal 9)>], [1435, #<Process::Status: pid 1435 SIGKILL (signal 9)>], [1524, #<Process::Status: pid 1524 SIGKILL (signal 9)>], [1656, #<Process::Status: pid 1656 SIGKILL (signal 9)>], [1686, #<Process::Status: pid 1686 SIGKILL (signal 9)>], [1296, #<Process::Status: pid 1296 SIGKILL (signal 9)>], [1416, #<Process::Status: pid 1416 SIGKILL (signal 9)>], [1307, #<Process::Status: pid 1307 SIGKILL (signal 9)>], [1427, #<Process::Status: pid 1427 SIGKILL (signal 9)>], [1674, #<Process::Status: pid 1674 SIGKILL (signal 9)>]]

but it seems that all the html files were created.

 ls -1 /tmp/*.html  | wc -l
40

and were all identical files

sighmon commented 2 years ago

@DonGiulio If you can't get anywhere with your current approach, can you pre-generate MJML > HTML templates in background tasks and then use those templates to be populated for realtime sending?