njsmith / pysrilm

An extremely simple Python wrapper for the SRI Language Modeling toolkit
BSD 2-Clause "Simplified" License
70 stars 20 forks source link

LM object is not serializeable #6

Closed junkmechanic closed 9 years ago

junkmechanic commented 9 years ago

Hi. First of all, Thanks for your work.

This is easy to reproduce. Try to unpickle a pickled LM object or try to deepcopy it. You will notice this common pattern in the error trace:

  File "srilm.pyx", line 102, in srilm.LM.__cinit__ (srilm.cpp:1378)
TypeError: __cinit__() takes at least 1 positional argument (0 given)

For example. the above error occurs during the deserializaiton of the LM object after being passed through a multiprocessing.Connection object. Currently, I work around the issue by initializing the object on the latter end.

Just wanted to point out in case you had overlooked this issue.

Cheers!

njsmith commented 9 years ago

Supporting pickle for LM objects would require taking your in-memory data structure, and then writing it out into a new ARPA format file in memory as a giant string, and then multiprocessing would copy that multi-gigabyte string into a new process, where it would then have to be loaded again by SRILM. I'm pretty sure this is not what anyone actually wants :-).

If you'd like to submit a PR so that pickle fails earlier (when pickling instead of when unpickling), I'll merge it.

For your actual problem: just opening the file in each process is the right general solution. If you want to access it from multiple processes at once, and don't want to have multiple copies of it in memory, and are on osx or linux, then there is also a trick you can use: before you spawn your workers, load up the LM in the parent process, and save it in a global variable. Because on Unix multiprocessing by default spawns workers using 'fork', when they start up they'll find the global variable is still there. And due to the tricks the kernel uses to implement this, each process's copy of the LM is stored in the same memory until you write to it. But LM objects are read-only, do you will never write to it, so in effect all the processes end up sharing one copy. This lets you load 10 copies of a 10 gigabyte model on a machine with 12 gigabytes of RAM. On May 26, 2015 6:11 AM, "Ankur Khanna" notifications@github.com wrote:

Hi. First of all, Thanks for your work.

This is easy to reproduce. Try to unpickle a pickled LM object or try to deepcopy it. You will notice this common pattern in the error trace:

File "srilm.pyx", line 102, in srilm.LM.cinit (srilm.cpp:1378) TypeError: cinit() takes at least 1 positional argument (0 given)

For example. the above error occurs during the deserializaiton of the LM object after being passed through a multiprocessing.Connection object. Currently, I work around the issue by initializing the object on the latter end.

Just wanted to point out in case you had overlooked this issue.

Cheers!

— Reply to this email directly or view it on GitHub https://github.com/njsmith/pysrilm/issues/6.

junkmechanic commented 9 years ago

Oh. I see your point. Closing issue.