uqfoundation / pathos

parallel graph management and execution in heterogeneous computing
http://pathos.rtfd.io
Other
1.38k stars 89 forks source link

ProcessPool Inaccurate Traceback #238

Closed tinducvo closed 2 years ago

tinducvo commented 2 years ago

I was trying to a proper traceback and I think I encountered any error.

In this code, I create a div0 error and print it out in the parent thread.

# from pathos.pools import ThreadPool as ProcessPool
from pathos.pools import ProcessPool as ProcessPool

def function(x):
    return 1/x

def errorReturnWrapper(func,
                       printTraceback = False):
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except KeyboardInterrupt:
            print("Worker got keyboard interrupt somehow")
        except Exception as e:
            import traceback
            error = str(e)
            tb = traceback.format_exc()
            return type(e)(f"{e}\n\nOriginal {tb}")
    return wrapper

if __name__ == "__main__":
    pool = ProcessPool(nodes = 1)
    results = pool.map(errorReturnWrapper(function), [0,1,2,3,4])
    print(results[0])

The error is as expected with ThreadPool:

division by zero

Original Traceback (most recent call last):
  File "**redacted**\exception.py", line 11, in wrapper
    return func(*args, **kwargs)
  File "**redacted**\exception.py", line 5, in function
    return 1/x
ZeroDivisionError: division by zero

But offset by 1 line with ProcessPool:



Original Traceback (most recent call last):
  File "**redacted**\exception.py", line 12, in wrapper
    except KeyboardInterrupt:
  File "**redacted**\exception.py", line -13, in function
ZeroDivisionError: division by zero```
mmckerns commented 2 years ago

Can you tell me which versions of Python, pathos, multiprocess, etc you are seeing the behavior in?

I'm not seeing the error in python 3.7, 3.8, or 3.9:

$ python3.7 div0err.py 
division by zero

Original Traceback (most recent call last):
  File "div0err.py", line 11, in wrapper
    return func(*args, **kwargs)
  File "div0err.py", line 5, in function
    return 1/x
ZeroDivisionError: division by zero
$ python3.7 -c "import pathos; print(pathos.__version__)" 
0.3.0.dev0
$ python3.7 -c "import multiprocess as mp; print(mp.__version__)" 
0.70.14.dev0
$ python3.7 -c "import dill; print(dill.__version__)" 
0.3.6.dev0

However, in Python 3.10, I do see an error:

$ python3.10 div0err.py 
division by zero

Original Traceback (most recent call last):
  File "/Users/mmckerns/div0err.py", line 12, in wrapper
    except KeyboardInterrupt:
  File "/Users/mmckerns/div0err.py", line 100, in function
ZeroDivisionError: division by zero

I still see the error in Python 3.10 when going directly through multiprocess... from multiprocess import Pool as ProcessPool

so this looks like a bug in the python 3.10 code in multiprocess.

tinducvo commented 2 years ago

Thanks for the prompt reply!

Python: 3.10.2 Pathos: 0.2.8 Multiprocess: 0.70.12.2 Dill: 0.3.4

mmckerns commented 2 years ago

Hmm... if I add a print to check the original traceback directly (before it is sent to multiprocess), I still see the error:

# from pathos.pools import ThreadPool as ProcessPool
# from pathos.pools import ProcessPool as ProcessPool
from multiprocess import Pool as ProcessPool

def function(x):
    return 1/x

def errorReturnWrapper(func,
                       printTraceback = False):
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except KeyboardInterrupt:
            print("Worker got keyboard interrupt somehow")
        except Exception as e:
            import traceback
            error = str(e)
            tb = traceback.format_exc()
            print(tb)
            return type(e)(f"{e}\n\nOriginal {tb}")
    return wrapper

if __name__ == "__main__":
    pool = ProcessPool(1)#nodes = 1)
    results = pool.map(errorReturnWrapper(function), [0,1,2,3,4])
    print(results[0])
$ python div0err.py 
Traceback (most recent call last):
  File "/Users/mmckerns/div0err.py", line 13, in wrapper
    except KeyboardInterrupt:
  File "/Users/mmckerns/div0err.py", line 132, in function
ZeroDivisionError: division by zero

division by zero

Original Traceback (most recent call last):
  File "/Users/mmckerns/div0err.py", line 13, in wrapper
    except KeyboardInterrupt:
  File "/Users/mmckerns/div0err.py", line 132, in function
ZeroDivisionError: division by zero

That's interesting in that the issue is seen directly from tb = traceback.format_exc().

mmckerns commented 2 years ago

Can you try modifying your code to throw a ZeroDivisionError with multiprocessing (not multiprocess) and return the traceback as a string? You'd need to make sure the function that you are sending will serialize.

tinducvo commented 2 years ago

I just tried it. The function cannot serialize with the try in the wrapper:

from multiprocessing import Pool
import traceback

def function(x):
    return 1/x

def error_return_wrapper(func):
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            return str(e)
    return wrapper

if __name__ == '__main__':
    mapped_function = error_return_wrapper(function)
    with Pool(5) as p:
        print(p.map(mapped_function, [1, 2, 3]))
tinducvo commented 2 years ago

It works if I don't use the wrapper with multiprocessing:

from multiprocessing import Pool
import traceback

def function(x):
    return 1/x

def error_return_wrapper(func):
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            return str(e)
    return wrapper

def modified_function(x):
    try:
        return 1/x
    except Exception as e:
        import traceback
        error = str(e)
        tb = traceback.format_exc()
        return type(e)(f"{e}\n\nOriginal {tb}")

if __name__ == '__main__':
    # mapped_function = error_return_wrapper(function)
    mapped_function = modified_function
    with Pool(5) as p:
        print(p.map(mapped_function, [0, 1, 2, 3]))

I also tried without wrapper and multiprocess. That still had the offset:

# from pathos.pools import ThreadPool as ProcessPool
from pathos.pools import ProcessPool as ProcessPool

def function(x):
    return 1/x

def errorReturnWrapper(func,
                       printTraceback = False):
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except KeyboardInterrupt:
            print("Worker got keyboard interrupt somehow")
        except Exception as e:
            import traceback
            error = str(e)
            tb = traceback.format_exc()
            return type(e)(f"{e}\n\nOriginal {tb}")
    return wrapper

def modified_function(x):
    try:
        return 1/x
    except Exception as e:
        import traceback
        error = str(e)
        tb = traceback.format_exc()
        return type(e)(f"{e}\n\nOriginal {tb}")

if __name__ == "__main__":
    pool = ProcessPool(nodes = 1)
    results = pool.map(modified_function, [0,1,2,3,4])
    print(results[0])
mmckerns commented 2 years ago

Thanks. I'm also seeing the same results with the versions I've quoted above.

mmckerns commented 2 years ago

If you disable dill in multiprocess (and instead use pickle), making no other changes... then the error goes away. So, I think the issue is rooted in pickling the traceback in python 3.10. It seems that this is either a dill bug, or a bug in multiprocessing (and hence multiprocess) that dill uncovers. I'm guessing the former...

mmckerns commented 2 years ago

Indeed, this is a dill bug.

import traceback
import dill
#dill.settings['recurse'] = True
#dill.detect.trace(True)

def function(x):
    return 1/x

def modified_function(x):
    try:
        return 1/x
    except Exception as e:
        import traceback
        error = str(e)
        tb = traceback.format_exc()
        return type(e)(f"{e}\n\nOriginal {tb}")

if __name__ == '__main__':
    mapped_function = dill.loads(dill.dumps(modified_function))
    map_results = map(mapped_function, [0, 1, 2, 3])
    results = [dill.loads(dill.dumps(i)) for i in map_results]
    print(results[0])

Results in:

$ python3.10 div0err3.py 
division by zero

Original Traceback (most recent call last):
  File "/Users/mmckerns/div0err3.py", line 12, in modified_function
    except Exception as e:
ZeroDivisionError: division by zero

$ python3.9 div0err3.py 
division by zero

Original Traceback (most recent call last):
  File "/Users/mmckerns/div0err3.py", line 11, in modified_function
    return 1/x
ZeroDivisionError: division by zero
mmckerns commented 2 years ago

The error is still present with this modification... so it appears that the issue is actually in serializing a function:

if __name__ == '__main__':
    mapped_function = dill.loads(dill.dumps(modified_function))
    results = list(map(mapped_function, [0, 1, 2, 3]))
    print(results[0])

I'll create an issue in dill and refer back to here.

mmckerns commented 2 years ago

closing as the related issue in dill is fixed.