mapillary / OpenSfM

Open source Structure-from-Motion pipeline
https://www.opensfm.org/
BSD 2-Clause "Simplified" License
3.38k stars 858 forks source link

Reconstruction process hangs indefinitely #190

Closed pierotofy closed 6 years ago

pierotofy commented 7 years ago

This is related to https://github.com/OpenDroneMap/WebODM/issues/147 and several other reports that we have encountered with OpenDroneMap users.

When processing large datasets, on computers with limited amount of RAM, while using a value of processes > 1 (forcing the use of Pool such as in https://github.com/mapillary/OpenSfM/blob/d22b94c6a46f5be88857290363943f084be4a7f9/opensfm/dense.py#L54), the process hangs indefinitely. This can be reproduced sporadically (not reliably) with real world inputs, but can be clearly simulated with this example. In this example, the Killer class acts to simulate a out-of-memory error, therefore sending a KILL signal to one of the subprocesses.

Notice that the Ended string is never printed and the process never terminates.

from multiprocessing import Pool
import time
import threading
from subprocess import check_output

def get_pid(name):
    return list(map(int,check_output(["pidof",name]).split()))

class Killer(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run (self):
        print("Wait 3 seconds...")
        time.sleep(3)
        pids = get_pid('python')
        print(pids)
        print("Found " + str(len(pids)) + " python processes")
        print("Killing middle process")

        check_output(["kill", "-9", str(pids[int(len(pids) / 2)])])

        print("Processes killed, now waiting for termination...")

def f(x):
    time.sleep(6)
    return x*x

if __name__ == '__main__':
    p = Pool(3)
    print("Started")
    Killer().start()
    print(p.map(f, [1, 2, 3]))
    print("Ended")

I haven't investigated possible solutions, but ideally opensfm should report to the user that the process ran out of memory and exit gracefully.

paulinus commented 7 years ago

Thanks for the minimal example!

From what I understand, the fact that signals are not handled when using Pool is because of a python bug. Similar problems and workarounds are reported in here and here

We should implement one of the workarounds since, currently, the only way of stopping the processes is using a sigquit or ^\

Apart from that, we were not terminating the Pools after using them in dense matching which could lead to a number of old processes hanging around. Should not happen anymore with current master.

pierotofy commented 7 years ago

@paulinus I've looked into the issue further. Unfortunately the workarounds you linked would only deal with SIGINT, there's no way to catch SIGKILL (which is sent by the OS when running out of memory).

My suggestion is to replace multiprocessing with loky (https://github.com/tomMoral/loky). If we were using Python 3 we could just use concurrent.futures, but since OpenSfM targets Python 2 loky seems like a good way to go:

Same minimal example with loky:

from multiprocessing import Pool
import time
import threading
from subprocess import check_output
from loky import ProcessPoolExecutor

def get_pid(name):
    return list(map(int,check_output(["pidof",name]).split()))

class Killer(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run (self):
        print("Wait 3 seconds...")
        time.sleep(3)
        pids = get_pid('python')
        print(pids)
        print("Found " + str(len(pids)) + " python processes")
        print("Killing process")

        check_output(["kill", "-9", str(pids[int(len(pids) / 3)])])

        print("Processes killed, now waiting for termination...")

def f(x):
    time.sleep(6)
    return x*x

if __name__ == '__main__':
    with ProcessPoolExecutor(max_workers=3) as executor:
        print("Started")
        Killer().start()
        results = executor.map(f, [1, 2, 3])
        print(list(results))
        print("Ended")
Started
Wait 3 seconds...
[20322, 20321, 20320, 20318, 20317, 3042]
Found 6 python processes
Killing middle process
Processes killed, now waiting for termination...
Traceback (most recent call last):
  File "test.py", line 37, in <module>
    print(list(results))
  File "/usr/lib/python3.6/site-packages/loky/process_executor.py", line 769, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/usr/lib/python3.6/site-packages/loky/_base.py", line 584, in result_iterator
    yield future.result()
  File "/usr/lib/python3.6/site-packages/loky/_base.py", line 431, in result
    return self.__get_result()
  File "/usr/lib/python3.6/site-packages/loky/_base.py", line 382, in __get_result
    raise self._exception
loky.process_executor.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

I can make a pull request for this, just wanted to hear your thoughts before going forward.

paulinus commented 7 years ago

Good find @pierotofy. I didn't know loky but it looks very good.

I'm worried about adding dependencies, especially because the project seems quite new. That being said, the change required to use it is small and easy to revert if loky ever goes unsupported.

On mac, i'm getting RuntimeWarnings like this one

/Users/paulinus/Library/Python/2.7/lib/python/site-packages/loky/backend/semlock.py:217: RuntimeWarning: semaphore are broken on OSX, release might increase its maximal value
  "increase its maximal value", RuntimeWarning)

when running the above example.

On Python 3, we need to support Python 2 but it would be great to start supporting Python 3 too soon.

pierotofy commented 7 years ago

Thanks for pointing the runtime warning; I'll do more testing on OSX prior to making changes.

pierotofy commented 7 years ago

Btw, this is a substitute for pidof which is missing in OSX:

#!/usr/bin/env ruby

target = ARGV[0] || ""
ids = []         
(`ps axc`).split("\n").each do |line|
  pid, tt, stat, time, command = line.chomp.split(" ", 5)
  if target == command
    ids << pid   
  end            
end              

exit 1 if ids.empty?
puts ids.join(" ")
pierotofy commented 6 years ago

In short, that warning means that we won't be able to cancel jobs.

On OSX, all the jobs will be en-queued at once, as the size of the Queue is not enforced. The size of the Queue is limited in ProcessPoolExecutor in order to allow you to cancel jobs. On OSX, you won't be able to cancel jobs because as soon as you submit them, they will be put in queue and thus considered running. The concurrent.futures API does not permit to kill running Futures. If you have no need to cancel jobs, you can ignore this Warning.

https://github.com/tomMoral/loky/issues/103

So the warning can be ignored. Unfortunately there's no way to catch the warning since warnings.catch_warnings() is not thread-safe https://stackoverflow.com/questions/12654267/python-3-catching-warnings-during-multiprocessing, so I was thinking of making a fork of loky, and manually comment out that warning.

dakotabenjamin commented 6 years ago

EDIT not a problem here

paulinus commented 6 years ago

good, closing this since @pierotofy implemented multiprocessing using loky. Thanks!