Instantiating RuleBasedBreakIterator from compiled rules or UDataMemory

robertBrnnn commented 5 years ago

Hi, I'm using PyICU to create RuleBasedBreakIterators from source rules at the moment, however it has become quite slow as there is a large amount of rules now. It takes roughly three minutes to instantiate the an instance whereas when there were less rules it took only a few seconds.

Is there a way to instantiate from compiled rules or UDataMemory as mentioned in http://icu-project.org/apiref/icu4c/classicu_1_1RuleBasedBreakIterator.html ?

I'd greatly appreciate any guidance. Thanks, Rob

ovalhub commented 5 years ago

Yes, in fact, there is. At line 1076, in iterators.cpp, you can see that there is an overload wrapper RuleBasedBreakIterator(path, name) that calls udata_open(path, NULL, name) and then constructs a RuleBasedBreakIterator with the returned UDataMemory * object.

Please, let me know if this works for you. If it doesn't, please send me a data file that I can use to test this with as I don't know how to produce one myself.

Thanks !

Andi..

ovalhub commented 5 years ago

On Thu, 30 May 2019, Robert Brennan wrote:

I'm using PyICU to create RuleBasedBreakIterators from source rules at the moment, however it has become quite slow as there is a large amount of rules now. It takes roughly three minutes to instantiate the an instance whereas when there were less rules it took only a few seconds.

Is there a way to instantiate from compiled rules or UDataMemory as mentioned in http://icu-project.org/apiref/icu4c/classicu_1_1RuleBasedBreakIterator.html ?

Yes, in fact, there is. At line 1076, in iterators.cpp, you can see that there is an overload wrapper RuleBasedBreakIterator(path, name) that calls udata_open(path, NULL, name) and then constructs a RuleBasedBreakIterator with the returned UDataMemory * object.

Please, let me know if this works for you. If it doesn't, please send me a data file that I can use to test this with as I don't know how to produce one myself.

Thanks !

Andi..

robertBrnnn commented 5 years ago

Hi Andi, I wasn't so sure how to create a UDataMemory myself either, but I managed to get one built eventually. PyICU returned an illegal argument error when trying to load via the overloaded wrapper you mention above.

I've created my own Cython version for libicu instead that can load an instance from binary rules and source rules. Also, I wasn't able to get the UDataMemory working on my cython version either, I think it might be due to how libicu determines source file encoding.

If you want to provide an efficient way to load rules in PyICU I'd recommend creating a wrapper for the binary rules constructor in libicu instead of the UDataMemory, I found it much more straightforward.

I don't think it's necessary to keep this issue open anymore, so I'll close it. Thanks for your help! Rob.

ovalhub commented 5 years ago

I now added support for this overload (that appeared in 4.8). Please, see the test_BreakIterator.testCreateInstanceFromBinaryRules() test for how to use this. The buffer you pass in is kept around by the iterator instance you're creating (by the wrapper, as the ICU object itself doesn't take ownership of it). I think that the reason I didn't support this originally is that bytes and string args overlap in PyICU and I resolved the ambiguity by trying the bytes version first, falling through to the text rules (bytes to unicode converted) next. It could also be because I didn't notice this after it appeared in ICU 4.8. Please, try it out and let me know if it works for your use case.

ovalhub commented 5 years ago

available in trunk

ovalhub / pyicu

Instantiating RuleBasedBreakIterator from compiled rules or UDataMemory #104