Open jbaiter opened 8 years ago
Very nice, thanks for sharing!
I failed to get it to run in Debian Jessie with either Python2/3 but that is probably an include path problem. Cython either refused to import shared_ptr
from libcpp.memory or segfaulted :|
It installed fine with Python3/Python2 in Arch Linux. Python2 is working fine, for Python3, run_uw3_500.py throws one of those pesky bytes/str errors:
Traceback (most recent call last):
File "run_uw3_500.py", line 53, in <module>
ocr.save("./model.clstm")
File "pyclstm.pyx", line 112, in pyclstm.ClstmOcr.save (pyclstm.cpp:2464)
cpdef save(self, str fname):
File "pyclstm.pyx", line 118, in pyclstm.ClstmOcr.save (pyclstm.cpp:2346)
cdef bint rv = self._ocr.maybe_save(fname)
File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string (pyclstm.cpp:3585)
TypeError: expected bytes, str found
I failed to get it to run in Debian Jessie with either Python2/3 but that is probably an include path problem. Cython either refused to import shared_ptr from libcpp.memory or segfaulted :|
Did you install Cython from the Jessie package? I just tried it with a pip install
ed Cython on Jessie and it works fine with both 2 and 3 (after fixing the bytes/str bug).
edit: Can confirm, this is due to Jessie shipping with Cython 0.21.1. Smart pointers like shared_ptr
were only added in 0.23 :-/ I updated the requirements accordingly.
I tried the version shipped in Jessie stable first, then pip install, but it seemed to fall back to the Jessie bundled path at some point. As I said, I guess it's just a path issue.
I'll try the fixed bytes/str commit later.
After removing cython3
, it works with Python3 in Jessie, no more unicode/str/bytes related exceptions :tada: It's weird that /usr/lib
cython takes precedence over /usr/local/lib
or $HOME/.local/lib
but apparently that's either an issue with Debian or my setup.
sudo pip2 install Cython; sudo pip2 install .
work fine but python2 run_uw3_500.py
immediately segaults.
Hm, that's weird :-) Can you make a core dump and check out the trace with gdb?
$ ulimit -c unlimited
$ python run_uw3_500.py
$ gdb $(which python) core
# Then enter `bt` to get the backtrace
The segfault was due to a compatibility problem with older versions of Pillow, Jessie uses 2.6.1 while I used 3.4.2 for developing. 2.9.0 added the width
and height
attributes, which I used to differentiate between Pillow.Image
and numpy.ndarray
in the image loading logic. Since images loaded with 2.6.1 did not have either of these attributes, they were interpreted as numpy arrays. Funnily, all the interfaces on Pillow.Image
I accessed during image loading were also present on numpy.ndarray
, but returned different things, which led to segfaults pretty deep into the stack.
Awesome. Are already working on interfacing the lower level INetwork interface? If not I'll put something together as I'm currently working on a new training subcommand for kraken and the old swig bindings are not complete enough for that purpose.
Nope, I played around with it for a while, but gave up on it pretty quickly. My main aim was to make accessing the high-level OCR stuff from clstmhl.h
available from Python, which is what >90% of all clstm users are currently using (via the CLI). I don't know if it's really worth the effort, since there are already a number of really good ML libraries with LSTM support available for Python.
Why do you need access to the lower-level APIs?
My main need is having access to the output matrix for running a slightly modified label extractor producing bounding boxes as the label locations are just the point of the maximum value in the softmax layer in a thresholded region. Explicit codec access is also rather useful.
I'd quite like to switch to a ML library more widely used but I haven't found one yet that doesn't use incredibly annoying serialization (pickles, pickles everywhere and somewhat easy to fix) and more importantly has reasonably performant model instantiation. With CLSTM I'm able to instantiate/deserialize models instantaneously while tensorflow and theano always run compilation (and per default optimization) steps which take at least a minute even on a modern machine. As far as I know it is also rather inherent in their design so there's no way around it.
@mittagessen what about this one: https://github.com/baidu-research/warp-ctc ?
warp-ctc used with LSTM https://github.com/dmlc/mxnet/tree/master/example/warpctc
I had a short look at mxnet as it seemed promising and I prefer its interface to theano's; initialization still takes quite a bit of time and warp-ctc is prone to crashes (so no drop-in replacement), although I'll probably work more with it for the layout analysis thingy once I get around to it.
Sorry for spamming but there's one major reason for using the lower level interface. By preloading the entire training set into memory and doing all the normalization, encoding, etc. once, I've just now decreased training time by ~2/3. While I'm fairly sure the main reason is just having everything in main memory rerunning the codec and line normalization over and over again seems needlessly wasteful.
That's a really good point. I'll see what I can do about exposing the lower-level interfaces :-)
@mittagessen
By preloading the entire training set into memory and doing all the normalization, encoding, etc. once, I've just now decreased training time by ~2/3.
how ? can i realize that through your kraken trainning api?
@wanghaisheng: You don't really as the old swig interface is broken, so it isn't quite possible to instantiate a network. What is working (since yesterday night) is continuing training a model with the separate_derivs branch and some minor bug fixes to the swig interface. Wait a few days until we've sorted out some of the parallel development.
@mittagessen I've started work on exposing the INetwork
interface, but am now stuck on creating wrappers around the Eigen tensor types (Eigen::Tensor<T, N>
, Eigen::TensorMap<T>
). It would be great if we could create an adapter so we can instantiate those types from numpy arrays (and vice versa) without having to copy the data. There's eigency, which claims to offer just that, but it's only for the regular Eigen types, not the (still officially unsupported) tensor types used by clstm :-/ Any ideas?
@jbaiter What about basing your cython binding on the older matrix based code?
The eigency code for eigen->numpy is just:
@cython.boundscheck(False)
cdef np.ndarray[float, ndim=2] ndarray_float_C(float *data, long rows, long cols, long row_stride, long col_stride):
cdef float[:,:] mem_view = <float[:rows,:cols]>data
dtype = 'float'
cdef int itemsize = np.dtype(dtype).itemsize
return as_strided(np.asarray(mem_view, dtype=dtype, order="C"), strides=[row_stride*itemsize, col_stride*itemsize])
for bazillion combinations of orders and data types and while I haven't looked at the memory layout of a tensor object it should work for 2nd order tensors without adaptation (ugly but workable for now).
The other way around is in eigency_cpp.h and will probably work for 2nd order tensors, too. For higher orders I'd have to take a look at how strides are implemented in both ndarray and eigen tensors.
I've merged this with current master in cython-2017 branch, so as not to interfere with any changes you may not have pushed.
Since the original Python bindings are not working anymore and are unlikely to be fixed/maintained in the future, I created new high-level bindings using Cython. The module is compatible with both Python 2 and 3 and can be installed by running
pip install .
in the root directory of the repository.For both training and prediction, images loaded via PIL/Pillow can be used, as well as numpy arrays.
Currently only the OCR functionality is exposed, but I plan on adding a wrapper around
ClstmText
in the future.The API documentation can be found at https://jbaiter.github.io/clstm.
An example on how the training and prediction API is used can be found in
run_uw3_500.py
. This script is very close to what therun-uw3-500
application does, only through Python, so it can be used to compare performance. In my tests I found that the performance of the Python and C++ versions is pretty much indistinguishable.