Closed pierotofy closed 6 years ago
Thanks for the minimal example!
From what I understand, the fact that signals are not handled when using Pool
is because of a python bug. Similar problems and workarounds are reported in here and here
We should implement one of the workarounds since, currently, the only way of stopping the processes is using a sigquit or ^\
Apart from that, we were not terminating the Pools after using them in dense matching which could lead to a number of old processes hanging around. Should not happen anymore with current master.
@paulinus I've looked into the issue further. Unfortunately the workarounds you linked would only deal with SIGINT, there's no way to catch SIGKILL (which is sent by the OS when running out of memory).
My suggestion is to replace multiprocessing with loky (https://github.com/tomMoral/loky). If we were using Python 3 we could just use concurrent.futures, but since OpenSfM targets Python 2 loky seems like a good way to go:
Same minimal example with loky:
from multiprocessing import Pool
import time
import threading
from subprocess import check_output
from loky import ProcessPoolExecutor
def get_pid(name):
return list(map(int,check_output(["pidof",name]).split()))
class Killer(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run (self):
print("Wait 3 seconds...")
time.sleep(3)
pids = get_pid('python')
print(pids)
print("Found " + str(len(pids)) + " python processes")
print("Killing process")
check_output(["kill", "-9", str(pids[int(len(pids) / 3)])])
print("Processes killed, now waiting for termination...")
def f(x):
time.sleep(6)
return x*x
if __name__ == '__main__':
with ProcessPoolExecutor(max_workers=3) as executor:
print("Started")
Killer().start()
results = executor.map(f, [1, 2, 3])
print(list(results))
print("Ended")
Started
Wait 3 seconds...
[20322, 20321, 20320, 20318, 20317, 3042]
Found 6 python processes
Killing middle process
Processes killed, now waiting for termination...
Traceback (most recent call last):
File "test.py", line 37, in <module>
print(list(results))
File "/usr/lib/python3.6/site-packages/loky/process_executor.py", line 769, in _chain_from_iterable_of_lists
for element in iterable:
File "/usr/lib/python3.6/site-packages/loky/_base.py", line 584, in result_iterator
yield future.result()
File "/usr/lib/python3.6/site-packages/loky/_base.py", line 431, in result
return self.__get_result()
File "/usr/lib/python3.6/site-packages/loky/_base.py", line 382, in __get_result
raise self._exception
loky.process_executor.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I can make a pull request for this, just wanted to hear your thoughts before going forward.
Good find @pierotofy. I didn't know loky but it looks very good.
I'm worried about adding dependencies, especially because the project seems quite new. That being said, the change required to use it is small and easy to revert if loky ever goes unsupported.
On mac, i'm getting RuntimeWarnings like this one
/Users/paulinus/Library/Python/2.7/lib/python/site-packages/loky/backend/semlock.py:217: RuntimeWarning: semaphore are broken on OSX, release might increase its maximal value
"increase its maximal value", RuntimeWarning)
when running the above example.
On Python 3, we need to support Python 2 but it would be great to start supporting Python 3 too soon.
Thanks for pointing the runtime warning; I'll do more testing on OSX prior to making changes.
Btw, this is a substitute for pidof
which is missing in OSX:
#!/usr/bin/env ruby
target = ARGV[0] || ""
ids = []
(`ps axc`).split("\n").each do |line|
pid, tt, stat, time, command = line.chomp.split(" ", 5)
if target == command
ids << pid
end
end
exit 1 if ids.empty?
puts ids.join(" ")
In short, that warning means that we won't be able to cancel jobs.
On OSX, all the jobs will be en-queued at once, as the size of the Queue is not enforced. The size of the Queue is limited in ProcessPoolExecutor in order to allow you to cancel jobs. On OSX, you won't be able to cancel jobs because as soon as you submit them, they will be put in queue and thus considered running. The concurrent.futures API does not permit to kill running Futures. If you have no need to cancel jobs, you can ignore this Warning.
https://github.com/tomMoral/loky/issues/103
So the warning can be ignored. Unfortunately there's no way to catch the warning since warnings.catch_warnings() is not thread-safe https://stackoverflow.com/questions/12654267/python-3-catching-warnings-during-multiprocessing, so I was thinking of making a fork of loky, and manually comment out that warning.
EDIT not a problem here
good, closing this since @pierotofy implemented multiprocessing using loky. Thanks!
This is related to https://github.com/OpenDroneMap/WebODM/issues/147 and several other reports that we have encountered with OpenDroneMap users.
When processing large datasets, on computers with limited amount of RAM, while using a value of
processes
> 1 (forcing the use ofPool
such as in https://github.com/mapillary/OpenSfM/blob/d22b94c6a46f5be88857290363943f084be4a7f9/opensfm/dense.py#L54), the process hangs indefinitely. This can be reproduced sporadically (not reliably) with real world inputs, but can be clearly simulated with this example. In this example, theKiller
class acts to simulate a out-of-memory error, therefore sending a KILL signal to one of the subprocesses.Notice that the
Ended
string is never printed and the process never terminates.I haven't investigated possible solutions, but ideally opensfm should report to the user that the process ran out of memory and exit gracefully.