tmbdev / clstm

A small C++ implementation of LSTM networks, focused on OCR.
Apache License 2.0
821 stars 224 forks source link

New high-level Python API #94

Open jbaiter opened 8 years ago

jbaiter commented 8 years ago

Since the original Python bindings are not working anymore and are unlikely to be fixed/maintained in the future, I created new high-level bindings using Cython. The module is compatible with both Python 2 and 3 and can be installed by running pip install . in the root directory of the repository.

For both training and prediction, images loaded via PIL/Pillow can be used, as well as numpy arrays.

Currently only the OCR functionality is exposed, but I plan on adding a wrapper around ClstmText in the future.

The API documentation can be found at https://jbaiter.github.io/clstm.

An example on how the training and prediction API is used can be found in run_uw3_500.py. This script is very close to what the run-uw3-500 application does, only through Python, so it can be used to compare performance. In my tests I found that the performance of the Python and C++ versions is pretty much indistinguishable.

kba commented 8 years ago

Very nice, thanks for sharing!

I failed to get it to run in Debian Jessie with either Python2/3 but that is probably an include path problem. Cython either refused to import shared_ptr from libcpp.memory or segfaulted :|

It installed fine with Python3/Python2 in Arch Linux. Python2 is working fine, for Python3, run_uw3_500.py throws one of those pesky bytes/str errors:

Traceback (most recent call last):
  File "run_uw3_500.py", line 53, in <module>
    ocr.save("./model.clstm")
  File "pyclstm.pyx", line 112, in pyclstm.ClstmOcr.save (pyclstm.cpp:2464)
    cpdef save(self, str fname):
  File "pyclstm.pyx", line 118, in pyclstm.ClstmOcr.save (pyclstm.cpp:2346)
    cdef bint rv = self._ocr.maybe_save(fname)
  File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string (pyclstm.cpp:3585)
TypeError: expected bytes, str found
jbaiter commented 8 years ago

I failed to get it to run in Debian Jessie with either Python2/3 but that is probably an include path problem. Cython either refused to import shared_ptr from libcpp.memory or segfaulted :|

Did you install Cython from the Jessie package? I just tried it with a pip installed Cython on Jessie and it works fine with both 2 and 3 (after fixing the bytes/str bug).

edit: Can confirm, this is due to Jessie shipping with Cython 0.21.1. Smart pointers like shared_ptr were only added in 0.23 :-/ I updated the requirements accordingly.

kba commented 8 years ago

I tried the version shipped in Jessie stable first, then pip install, but it seemed to fall back to the Jessie bundled path at some point. As I said, I guess it's just a path issue.

I'll try the fixed bytes/str commit later.

kba commented 8 years ago

After removing cython3, it works with Python3 in Jessie, no more unicode/str/bytes related exceptions :tada: It's weird that /usr/lib cython takes precedence over /usr/local/lib or $HOME/.local/lib but apparently that's either an issue with Debian or my setup.

sudo pip2 install Cython; sudo pip2 install . work fine but python2 run_uw3_500.py immediately segaults.

jbaiter commented 8 years ago

Hm, that's weird :-) Can you make a core dump and check out the trace with gdb?

$ ulimit -c unlimited
$ python run_uw3_500.py
$ gdb $(which python) core
# Then enter `bt` to get the backtrace
jbaiter commented 8 years ago

The segfault was due to a compatibility problem with older versions of Pillow, Jessie uses 2.6.1 while I used 3.4.2 for developing. 2.9.0 added the width and height attributes, which I used to differentiate between Pillow.Image and numpy.ndarray in the image loading logic. Since images loaded with 2.6.1 did not have either of these attributes, they were interpreted as numpy arrays. Funnily, all the interfaces on Pillow.Image I accessed during image loading were also present on numpy.ndarray, but returned different things, which led to segfaults pretty deep into the stack.

mittagessen commented 8 years ago

Awesome. Are already working on interfacing the lower level INetwork interface? If not I'll put something together as I'm currently working on a new training subcommand for kraken and the old swig bindings are not complete enough for that purpose.

jbaiter commented 8 years ago

Nope, I played around with it for a while, but gave up on it pretty quickly. My main aim was to make accessing the high-level OCR stuff from clstmhl.h available from Python, which is what >90% of all clstm users are currently using (via the CLI). I don't know if it's really worth the effort, since there are already a number of really good ML libraries with LSTM support available for Python. Why do you need access to the lower-level APIs?

mittagessen commented 8 years ago

My main need is having access to the output matrix for running a slightly modified label extractor producing bounding boxes as the label locations are just the point of the maximum value in the softmax layer in a thresholded region. Explicit codec access is also rather useful.

I'd quite like to switch to a ML library more widely used but I haven't found one yet that doesn't use incredibly annoying serialization (pickles, pickles everywhere and somewhat easy to fix) and more importantly has reasonably performant model instantiation. With CLSTM I'm able to instantiate/deserialize models instantaneously while tensorflow and theano always run compilation (and per default optimization) steps which take at least a minute even on a modern machine. As far as I know it is also rather inherent in their design so there's no way around it.

amitdo commented 8 years ago

@mittagessen what about this one: https://github.com/baidu-research/warp-ctc ?

amitdo commented 8 years ago

warp-ctc used with LSTM https://github.com/dmlc/mxnet/tree/master/example/warpctc

mittagessen commented 8 years ago

I had a short look at mxnet as it seemed promising and I prefer its interface to theano's; initialization still takes quite a bit of time and warp-ctc is prone to crashes (so no drop-in replacement), although I'll probably work more with it for the layout analysis thingy once I get around to it.

mittagessen commented 8 years ago

Sorry for spamming but there's one major reason for using the lower level interface. By preloading the entire training set into memory and doing all the normalization, encoding, etc. once, I've just now decreased training time by ~2/3. While I'm fairly sure the main reason is just having everything in main memory rerunning the codec and line normalization over and over again seems needlessly wasteful.

jbaiter commented 8 years ago

That's a really good point. I'll see what I can do about exposing the lower-level interfaces :-)

wanghaisheng commented 8 years ago

@mittagessen

By preloading the entire training set into memory and doing all the normalization, encoding, etc. once, I've just now decreased training time by ~2/3.

how ? can i realize that through your kraken trainning api?

mittagessen commented 8 years ago

@wanghaisheng: You don't really as the old swig interface is broken, so it isn't quite possible to instantiate a network. What is working (since yesterday night) is continuing training a model with the separate_derivs branch and some minor bug fixes to the swig interface. Wait a few days until we've sorted out some of the parallel development.

jbaiter commented 8 years ago

@mittagessen I've started work on exposing the INetwork interface, but am now stuck on creating wrappers around the Eigen tensor types (Eigen::Tensor<T, N>, Eigen::TensorMap<T>). It would be great if we could create an adapter so we can instantiate those types from numpy arrays (and vice versa) without having to copy the data. There's eigency, which claims to offer just that, but it's only for the regular Eigen types, not the (still officially unsupported) tensor types used by clstm :-/ Any ideas?

amitdo commented 8 years ago

@jbaiter What about basing your cython binding on the older matrix based code?

mittagessen commented 8 years ago

The eigency code for eigen->numpy is just:

@cython.boundscheck(False)
cdef np.ndarray[float, ndim=2] ndarray_float_C(float *data, long rows, long cols, long row_stride, long col_stride):
    cdef float[:,:] mem_view = <float[:rows,:cols]>data
    dtype = 'float'
    cdef int itemsize = np.dtype(dtype).itemsize
    return as_strided(np.asarray(mem_view, dtype=dtype, order="C"), strides=[row_stride*itemsize, col_stride*itemsize])

for bazillion combinations of orders and data types and while I haven't looked at the memory layout of a tensor object it should work for 2nd order tensors without adaptation (ugly but workable for now).

The other way around is in eigency_cpp.h and will probably work for 2nd order tensors, too. For higher orders I'd have to take a look at how strides are implemented in both ndarray and eigen tensors.

kba commented 6 years ago

I've merged this with current master in cython-2017 branch, so as not to interfere with any changes you may not have pushed.