segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
667 stars 39 forks source link

Performance of Rust crate #30

Closed s3bk closed 3 years ago

s3bk commented 3 years ago

I am getting fairly poor performance in release mode (CPU)… 2kB/s.

Is there a guide on using the GPU?

bminixhofer commented 3 years ago

Hi, which bindings to nnsplit are you using? The Rust bindings can be quite slow unfortunately because batching in tract does not work if the length dimension is dynamic too (there can only be one dynamic dim). Python bindings should be very fast though with sufficiently large batch size.

s3bk commented 3 years ago

Hello, I am indeed using it from Rust. What would be needed to get a fixed length dimension in Rust?

bminixhofer commented 3 years ago

Having a fixed length is not possible or at least not feasible. That would mean specifying ahead of time that you are e.g. always using 100 bytes as input. That means that you have to pad input which is shorter than that and have to truncate input which is longer. Padding shorter texts is possible (but will slow things down too) but truncating longer text loses information.

The proper fix is tract being able to optimize models across more than one dimension which is tracked here: https://github.com/sonos/tract/issues/313

The easiest way to make things faster right now is using the Python bindings (the Python bindings are written in Rust too, you might get https://github.com/bminixhofer/nnsplit/tree/master/bindings/python/src usable from your Rust code). Alternatively you can use the srx crate which will generate lower quality splits but is much, much faster.

s3bk commented 3 years ago

Thanks for pointing me at srx. Much, much faster is indeed what I am looking for. Maybe I can do something about https://github.com/sonos/tract/issues/313 … after reading into it.

bminixhofer commented 3 years ago

That would be great! FYI, https://github.com/languagetool-org/languagetool/blob/master/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx is the most useful SRX file for sentence segmentation that I know of.

I'll leave this open for now, performance of the Rust backend is indeed an issue (and the reason I made srx ;) ).

bminixhofer commented 3 years ago

There's been significant progress in tract and batched input is working now: https://github.com/sonos/tract/issues/383

Once it is released, I'll update it in nnsplit. It should lead to a > 20x Speedup of the Rust & Javascript bindings for batched input.

s3bk commented 3 years ago

Nice, I need to check it out.

bminixhofer commented 3 years ago

I finally got around to updating this, the Rust bindings should be at least 10x faster as of v0.5.8 for sufficiently large input.