snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

Java performance improvements from Apache Lucene #195

Closed rmuir closed 7 months ago

rmuir commented 7 months ago

This is long overdue...

The changes mean that users will need java 7 at a minimum, but since java 7 version is long EOL, I don't think it will cause anyone grief.

Users can still use setCurrent(String) ... stem() ... getCurrent(), but this adds support for a higher-performance approach: setCurrent(char[], int), getCurrentBuffer(), stem() ... getCurrentBuffer()/getCurrentBufferLength()

This avoids many per-word object allocations:

When indexing documents, all these per-word allocations put too much pressure on garbage collection and bottleneck performance.

ojwb commented 7 months ago

Thanks, mostly this looks good but I've spotted some issues and noted them above (mostly minor - the incorrect implementation of len and size really needs addressing though).

It would also be helpful to update java/org/tartarus/snowball/TestApp.java to demonstrate the new best practice instead of creating a new String for each input word - could you do that? (I could try but I don't write much Java so the result would probably be suboptimal.)

rmuir commented 7 months ago

I updated the TestApp in edaa2fa02a05de3b1107afd0a3eceb90e8f9448b with some very timing data...

I will followup for your other comments one-by-one, thank you for looking into this.

ojwb commented 7 months ago

Thanks for the updates - I'm happy to merge now (and I can then tweak so methodObject only gets generated when required) unless you've anything else you're about to add?

rmuir commented 7 months ago

Yes if you could help with the conditional emit of methodObject that would be very helpful, I think it might be beyond my abilities at the moment. I can followup with separate PRs for any other things, but this one is the big one! Thank you again for your help.

ojwb commented 7 months ago

Yes if you could help with the conditional emit of methodObject that would be very helpful, I think it might be beyond my abilities at the moment.

Done in 34f3612e5e8c48975243bc2e87561abdac5aa9bb.

I can followup with separate PRs for any other things, but this one is the big one! Thank you again for your help.

It'd be good to aim to have Lucene able to use an unmodified version of Snowball - not just because then all Java users can benefit from improvements, but also it's caused confusion for multiple users over the years who've generated Java code for a new stemmer with upstream Snowball then tried to add it to their Lucene source tree.

ojwb commented 7 months ago

I will also document the new Java 7 requirement (I agree it's not problematic but we should state version requirements up front).