src-d / modelforge

Python library to share machine learning models easily and reliably.
Apache License 2.0
18 stars 13 forks source link

Problems with pickling loaded model containing numpy.ndarray #74

Closed irinakhismatullina closed 5 years ago

irinakhismatullina commented 5 years ago

Faced problems with multiprocessing inside loaded modelforge model. Happens both in lazy and not lazy load modes.

Here is the code sample, reproducing error:

from multiprocessing import Pool
import pickle
import traceback
import numpy
from modelforge import Model

class NumpyArray(Model):
    NAME = "numpy_array"

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.array = numpy.random.normal(size=(4, 4))

    def _generate_tree(self):
        tree = self.__dict__.copy()
        for key in vars(Model()):
            del tree[key]
        return tree

    def _load_tree(self, tree):
        self.__dict__.update(tree)

    def pickle_test(self, path: str):
        with open(path, "wb") as out:
            pickle.dump(self, out)

    def mult(self, coeff: float):
        return self.array * coeff

    def multithread_test(self):
        coeffs = numpy.random.normal(size=16)
        with Pool(4) as pool:
            results = pool.map(self.mult, coeffs)
        return sum(results)

Test non-lazy mode:

arr_obj = NumpyArray()
arr_obj.save("numpy_array.asdf")

new_arr_obj = NumpyArray()
new_arr_obj.load("numpy_array.asdf", lazy=False)
new_arr_obj.pickle_test()

Here is the output:

TypeErrorTraceback (most recent call last)
<ipython-input-148-3f72419a5166> in <module>()
----> 1 new_arr_obj.pickle_test("array.pkl")

<ipython-input-142-d11f51c103b9> in pickle_test(self, path)
     24     def pickle_test(self, path: str):
     25         with open(path, "wb") as out:
---> 26             pickle.dump(self, out)
     27 
     28     def mult(self, coeff: float):

TypeError: cannot serialize '_io.BufferedReader' object

Same with multithreading:

new_arr_obj.multithread_test()

Gets:

TypeErrorTraceback (most recent call last)
<ipython-input-149-e6fc3a006712> in <module>()
      4 new_arr_obj = NumpyArray()
      5 new_arr_obj.load("numpy_array.asdf", lazy=False)
----> 6 new_arr_obj.multithread_test()

<ipython-input-142-d11f51c103b9> in multithread_test(self)
     32         coeffs = numpy.random.normal(size=16)
     33         with Pool(4) as pool:
---> 34             results = pool.map(self.mult, coeffs)
     35         return sum(results)
     36 

/usr/lib/python3.5/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    258         in a list that is returned.
    259         '''
--> 260         return self._map_async(func, iterable, mapstar, chunksize).get()
    261 
    262     def starmap(self, func, iterable, chunksize=None):

/usr/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
    606             return self._value
    607         else:
--> 608             raise self._value
    609 
    610     def _set(self, i, obj):

/usr/lib/python3.5/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
    383                         break
    384                     try:
--> 385                         put(task)
    386                     except Exception as e:
    387                         job, ind = task[:2]

/usr/lib/python3.5/multiprocessing/connection.py in send(self, obj)
    204         self._check_closed()
    205         self._check_writable()
--> 206         self._send_bytes(ForkingPickler.dumps(obj))
    207 
    208     def recv_bytes(self, maxlength=None):

/usr/lib/python3.5/multiprocessing/reduction.py in dumps(cls, obj, protocol)
     48     def dumps(cls, obj, protocol=None):
     49         buf = io.BytesIO()
---> 50         cls(buf, protocol).dump(obj)
     51         return buf.getbuffer()
     52 

TypeError: cannot serialize '_io.BufferedReader' object

Same happens in lazy mode. Calling this functions in original class instance works fine.

This fixes the problem locally (can be done inside _load_tree()):

new_arr_obj.array = numpy.array(new_arr_obj.array)
new_arr_obj.multithread_test()
new_arr_obj.pickle_test("array.pkl")

It passes, but looks like numpy arrays non-lazy loading is meant to work right out-of-the-box.

vmarkovtsev commented 5 years ago

On it.

vmarkovtsev commented 5 years ago

@irinakhismatullina What are the local versions of

I cannot reproduce it.

irinakhismatullina commented 5 years ago
modelforge==0.9.3
asdf==2.3.1
vmarkovtsev commented 5 years ago

So @irinakhismatullina I cannot reproduce it in Travis, too: https://travis-ci.org/src-d/modelforge/builds/490553316

Looks like you have to debug this yourself. Reproduce a working environment (in docker) and then compare the two behaviors.

vmarkovtsev commented 5 years ago

@irinakhismatullina Ping

irinakhismatullina commented 5 years ago

I haven't looked into it since then, in my model I used workaround described above. Seeing that you couldn't reproduce the bug, probably the issue may be closed.

vmarkovtsev commented 5 years ago

OK, py 3.5 support is dropped anyway.