rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
147 stars 11 forks source link

Update to PyO3 0.7 #56

Closed rth closed 5 years ago

rth commented 5 years ago

This updates to the lastest PyO3, which allows using lifetimes in pymethods. As result tokenization in Python is a bit faster by avoiding string copies.

On master,

python3.7 benchmarks/bench_tokenizers.py
 Tokenizing 19924 documents
         Python re.findall(r'\b\w\w+\b', ...): 2.93s [31.0 MB/s, 2450 kWPS]
                RegexpTokenizer(r'\b\w\w+\b'): 1.96s [46.5 MB/s, 3671 kWPS]
   UnicodeSegmentTokenizer(word_bounds=False): 2.97s [30.7 MB/s, 2269 kWPS]
    UnicodeSegmentTokenizer(word_bounds=True): 3.58s [25.4 MB/s, 3182 kWPS]
                         VTextTokenizer('en'): 4.11s [22.1 MB/s, 2467 kWPS]
                        CharacterTokenizer(4): 7.73s [11.8 MB/s, 5927 kWPS]

after this PR,

# Tokenizing 19924 documents
         Python re.findall(r'\b\w\w+\b', ...): 2.92s [31.2 MB/s, 2460 kWPS]
                RegexpTokenizer(r'\b\w\w+\b'): 1.40s [64.8 MB/s, 5119 kWPS]
   UnicodeSegmentTokenizer(word_bounds=False): 2.48s [36.8 MB/s, 2721 kWPS]
    UnicodeSegmentTokenizer(word_bounds=True): 2.65s [34.3 MB/s, 4292 kWPS]
                         VTextTokenizer('en'): 3.32s [27.4 MB/s, 3053 kWPS]
                        CharacterTokenizer(4): 4.47s [20.4 MB/s, 10252 kWPS]
rth commented 5 years ago

Hmm, no actually creating a PyList from Vec<&str> works but segfaults on Windows (probably due to the use of unsafe in Pyo3) and the fact that lifetimes are not right. Revering the change to tokenizers, unfortunately, though it should be possible to optimize this further.

Edit: or rather it seems to be a regression in pyo3 as vectorization tests segfault.

rth commented 5 years ago

Managed to reproduce the error on Windows. It's unrelated to tokenizers,

``` tests/test_vectorize.py::test_count_vectorizer thread '' panicked at 'An error occurred while initializing class SliceBox', C:\Users\Administrator\.cargo\registry\src\github.com-1ecc6299db9ec823\pyo3-0.7.0\src\type_object.rs:260:17 stack backtrace: 0: std::sys::windows::backtrace::set_frames at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\sys\windows\backtrace\mod.rs:94 1: std::sys::windows::backtrace::unwind_backtrace at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\sys\windows\backtrace\mod.rs:81 2: std::sys_common::backtrace::_print at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\sys_common\backtrace.rs:70 3: std::sys_common::backtrace::print at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\sys_common\backtrace.rs:58 4: std::panicking::default_hook::{{closure}} at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\panicking.rs:200 5: std::panicking::default_hook at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\panicking.rs:215 6: std::panicking::rust_panic_with_hook at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\panicking.rs:478 7: std::panicking::continue_panic_fmt at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\panicking.rs:385 8: std::panicking::begin_panic_fmt at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\panicking.rs:340 9: ::init_type::{{closure}} 10: >::new 11: >::from_boxed_slice 12: , D> as numpy::convert::IntoPyArray>::into_pyarray 13: ::init_type::{{closure}} 14: ::init_type::{{closure}} 15: PyMethodDef_RawFastCallKeywords 16: PyMethodDef_RawFastCallKeywords 17: PyEval_EvalFrameDefault 18: PyMethodDef_RawFastCallKeywords 19: PyEval_EvalFrameDefault 20: PyFunction_FastCallDict 21: PySlice_New 22: PyEval_EvalFrameDefault 23: PyEval_EvalCodeWithName 24: PyFunction_FastCallDict 25: PySlice_New 26: PyEval_EvalFrameDefault 27: PyEval_EvalCodeWithName 28: PyMethodDef_RawFastCallKeywords 29: PyEval_EvalFrameDefault 30: PyEval_EvalCodeWithName 31: PyMethodDef_RawFastCallKeywords 32: PyEval_EvalFrameDefault 33: PyMethodDef_RawFastCallKeywords 34: PyEval_EvalFrameDefault 35: PyEval_EvalCodeWithName 36: PyFunction_FastCallDict 37: PyObject_Call_Prepend 38: PyType_FromSpecWithBases 39: PyObject_FastCallKeywords 40: PyMethodDef_RawFastCallKeywords 41: PyEval_EvalFrameDefault 42: PyMethodDef_RawFastCallKeywords 43: PyEval_EvalFrameDefault 44: PyFunction_FastCallDict 45: PySlice_New 46: PyEval_EvalFrameDefault 47: PyEval_EvalCodeWithName 48: PyMethodDef_RawFastCallKeywords 49: PyEval_EvalFrameDefault 50: PyEval_EvalCodeWithName 51: PyMethodDef_RawFastCallKeywords 52: PyEval_EvalFrameDefault 53: PyMethodDef_RawFastCallKeywords 54: PyEval_EvalFrameDefault 55: PyEval_EvalCodeWithName 56: PyFunction_FastCallDict 57: PyObject_Call_Prepend 58: PyType_FromSpecWithBases 59: PySlice_New 60: PyEval_EvalFrameDefault 61: PyEval_EvalCodeWithName 62: PyMethodDef_RawFastCallKeywords 63: PyEval_EvalFrameDefault 64: PyEval_EvalCodeWithName 65: PyMethodDef_RawFastCallKeywords 66: PyEval_EvalFrameDefault 67: PyEval_EvalCodeWithName 68: PyFunction_FastCallDict 69: PySlice_New 70: PyEval_EvalFrameDefault 71: PyEval_EvalCodeWithName 72: PyMethodDef_RawFastCallKeywords 73: PyEval_EvalFrameDefault 74: PyEval_EvalCodeWithName 75: PyMethodDef_RawFastCallKeywords 76: PyEval_EvalFrameDefault 77: PyFunction_FastCallDict 78: PySlice_New 79: PyEval_EvalFrameDefault 80: PyEval_EvalCodeWithName 81: PyMethodDef_RawFastCallKeywords 82: PyEval_EvalFrameDefault 83: PyEval_EvalCodeWithName 84: PyMethodDef_RawFastCallKeywords 85: PyEval_EvalFrameDefault 86: PyMethodDef_RawFastCallKeywords 87: PyEval_EvalFrameDefault 88: PyEval_EvalCodeWithName 89: PyFunction_FastCallDict 90: PyObject_Call_Prepend 91: PyType_FromSpecWithBases 92: PyObject_FastCallKeywords 93: PyMethodDef_RawFastCallKeywords 94: PyEval_EvalFrameDefault 95: PyFunction_FastCallDict 96: PySlice_New 97: PyEval_EvalFrameDefault 98: PyEval_EvalCodeWithName 99: PyMethodDef_RawFastCallKeywords Windows fatal exception: code 0xc000001d ```

and only happens when building a wheel (as opposed to installing in developement mode).

rth commented 5 years ago

Using the latest rust nightly (nightly-2019-02-28 was used before) appears to resolve the previous rust-numpy error. Merging.