Closed rmuir closed 7 months ago
Thanks, mostly this looks good but I've spotted some issues and noted them above (mostly minor - the incorrect implementation of len
and size
really needs addressing though).
It would also be helpful to update java/org/tartarus/snowball/TestApp.java
to demonstrate the new best practice instead of creating a new String
for each input word - could you do that? (I could try but I don't write much Java so the result would probably be suboptimal.)
I updated the TestApp in edaa2fa02a05de3b1107afd0a3eceb90e8f9448b with some very timing data...
I will followup for your other comments one-by-one, thank you for looking into this.
Thanks for the updates - I'm happy to merge now (and I can then tweak so methodObject
only gets generated when required) unless you've anything else you're about to add?
Yes if you could help with the conditional emit of methodObject
that would be very helpful, I think it might be beyond my abilities at the moment. I can followup with separate PRs for any other things, but this one is the big one! Thank you again for your help.
Yes if you could help with the conditional emit of
methodObject
that would be very helpful, I think it might be beyond my abilities at the moment.
Done in 34f3612e5e8c48975243bc2e87561abdac5aa9bb.
I can followup with separate PRs for any other things, but this one is the big one! Thank you again for your help.
It'd be good to aim to have Lucene able to use an unmodified version of Snowball - not just because then all Java users can benefit from improvements, but also it's caused confusion for multiple users over the years who've generated Java code for a new stemmer with upstream Snowball then tried to add it to their Lucene source tree.
I will also document the new Java 7 requirement (I agree it's not problematic but we should state version requirements up front).
This is long overdue...
char[]
to reduce allocations and overhead.MethodHandle.invokeExact()
instead of reflection.The changes mean that users will need java 7 at a minimum, but since java 7 version is long EOL, I don't think it will cause anyone grief.
Users can still use
setCurrent(String) ... stem() ... getCurrent()
, but this adds support for a higher-performance approach:setCurrent(char[], int), getCurrentBuffer(), stem() ... getCurrentBuffer()/getCurrentBufferLength()
This avoids many per-word object allocations:
String
for every input word.byte[]/char[]
of thatString
they were forced to create.char[]
for that StringBuilder.String
when they retrieve the result.When indexing documents, all these per-word allocations put too much pressure on garbage collection and bottleneck performance.