Closed Minauras closed 1 month ago
I'd start by checking if it's actually parallel hanging or postgres, so
Parallel.map do
puts 'a'
x = stuff
puts 'b'
x
end
or something like that
then you can tell if it's the db work hanging or parallel itself
Thanks for the answer, I did something like
Parallel.map(x, in_processes: 5) do |y|
puts "a"
ret = my_function(y)
puts "b"
ret
end
puts "c"
And the two minutes slowdown happens between b and c, so I assume it's parallel hanging? Is there a way to see what parallel might be doing? Parallel doesn't have a 2min timeout or something, right?
the only thing it does there is send the data through a pipe ... which might hang for some weird unix reason 😞
in_threads: 5
and see if you can reproduce (they don't use pipes so should work ...)... I hope that will show it's the "sending through the pipe" part next can do
puts "b"
puts Marshal.dump(ret).size
puts "c"
to see if the issue is serialization and if the data maybe is very big
Thanks a lot!
in_threads
doesn't work for me, it throws an error block in wait_poll: could not obtain a database connection within 5.000 seconds (waited 5.003 seconds) (ActiveRecord::ConnectionTimeoutError)
(I have a ActiveRecord::Base.connection_pool.with_connection
block in the process but I checked and it doesn't take any time to exit that block.each
instead of .map
and ignoring the output, but the same 2min timeout stayed at the same stageputs Marshal.dump(ret).size
returns 59
in less than a second, my return values are just small hashesSo it seems it wouldn't be an issue with the pipe or with the serialization?
yeah the .each still sends things over the pipe, so that could still be the issue
next I'd do is bundle open parallel
and start dropping some puts
around the Worker.work (line 73)
and wait
line 91 and see if they show reading from the pipe being stuck
maybe also line 601 inside the ensure
to see if closing the pipe hangs
ideally find a way to make the hanging part shareable, but that might be hard 😞
Hi, updating this, thanks for the help!
I wasn't able to investigate in the gem itself, as it's not possible in my environment. However, I found that the issue does not come from the parallel gem.
I replaced
Parallel.each(x, in_processes: 5) do |y|
puts "a"
ret = my_function(y)
puts "b"
ret
end
puts "c"
by
x.each do |y|
Process.fork do
puts "a"
ret = my_function(y)
puts "b"
end
end
Process.waitall
puts "c"
And the issue was the same, 2 minutes of hanging between "b" and "c", so the issue is reproducible without parallel.
Then, since seemingly nothing was happening while the program was hanging, I tried exiting the subprocesses early, after "b":
x.each do |y|
Process.fork do
puts "a"
ret = my_function(y)
puts "b"
abort
end
end
Process.waitall
puts "c"
This does nothing, however, when using exit!
instead of abort
:
x.each do |y|
Process.fork do
puts "a"
ret = my_function(y)
puts "b"
exit!
end
end
Process.waitall
puts "c"
then the hanging disappears.
Apparently the difference between exit!
and abort
is that exit!
skips at_exit
callbacks, so it might be something to do with at_exit
, however I tried what was described in this thread and found that no at_exit
callback was registered by any gem.
This problem seems to go beyond what I'm capable of debugging, so I'm happy to settle with exit!
as a workaround, though I still have no idea what the issue is.
Thanks a lot for the help!
I have a script that periodically runs some heavy computation in 5 processes, and I'm timing the time it takes for each run to complete.
The computation involves querying from a Postgres DB and a Little Table DB and some processing of that data, if that's important.
Sometimes, the run takes ~15s to complete, but sometimes it takes ~2min15s to complete.
When looking at the logs, I find that the processes don't do more computation during those runs, but after they're done computing and have exited their function, Parallel.map hangs for 2min seemingly doing nothing before the letting the rest of the script run.
What could be happening here? Any idea as to what I should investigate? Unfortunately I cannot share a reproducible example. Thanks for any help!