Arabic Stemmer improvement for Better Search Accuracy [LUCENE-8028]

mikemccand commented 7 years ago

HI, this is Ayah - bidi developer at IBM Egypt - Globalization Team, we are responsible to support Arabic at IBM products and services and as we use lucence at many of services, we found that it needs major improvement at Arabic stemmer, we implement the following two papers https://dl.acm.org/citation.cfm?id=1921657 and http://waset.org/publications/10005688/arabic-light-stemmer-for-better-search-accuracy to improve lucene arabic stemmer function and would like to open a Pull request to let you integrate it as a part of lucene

Legacy Jira details

LUCENE-8028 by Ayah Shamandi on Oct 31 2017, updated Dec 06 2017

mikemccand commented 7 years ago

Hi, we should add it as an option! It is ok to have multiple stemmers (choices).

I think we should be conservative about changing the default: at least for the second paper (which isn't paywalled, so i could quickly look), this appears to incorporate a dictionary-based approach (domain-dependent, typically perform less well on average than rule-based due to OOV) and i don't yet see any standard IR experiments confirming the improvement.

[Legacy Jira: Robert Muir (@rmuir) on Oct 31 2017]

mikemccand commented 7 years ago

So you mean that I can start implementing it ..... right?

[Legacy Jira: Ayah Shamandi on Nov 08 2017]

mikemccand commented 6 years ago

GitHub user Ashamandi opened a pull request:

https://github.com/apache/lucene-solr/pull/274

[LUCENE-8028] Arabic stemmer enhancement 

Hello, this PR is to add Enhancement for Arabic Stemmer #https://issues.apache.org/jira/browse/LUCENE-8028

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ACGC/lucene-solr master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/lucene-solr/pull/274.patch

To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:

This closes #274

commit 9c5d4f2b5584d26627d04414d8048c56b899210e Author: Ashamandi <ashamand@eg.ibm.com> Date: 2017-11-15T09:19:16Z

Arabic Stemmer enhancement

commit 667f3f548692c6ea1b3918a5bc9393e25d7f0c39 Author: Ashamandi <ashamand@eg.ibm.com> Date: 2017-11-15T12:54:07Z

tune code style

[Legacy Jira: ASF GitHub Bot on Nov 15 2017]

mikemccand commented 6 years ago

Can we instead factor out this stemmer into its own stemmer file? I don't think we should mix together two stemmers in the same file with conditionals. See for example the German package (or many other languages) where there are multiple stemmers.

Also lets avoid modifying the analyzer for now. The analyzer just represents defaults but we shouldn't add conditional options. Instead as a start we should just add the new stemmer, and make it easy for people to instantiate it e.g. in CustomAnalyzer.

[Legacy Jira: Robert Muir (@rmuir) on Nov 15 2017]

mikemccand commented 6 years ago

Also if we could avoid using naming such as "high accuracy"/"smart"/"fast" in the new stemmer since it will confuse users. If there is no better name then perhaps name it based upon the authors of the algorithm instead.

Finally is it possible to have some simple unit tests? We need these to be able to maintain the code going forward.

[Legacy Jira: Robert Muir (@rmuir) on Nov 15 2017]

mikemccand commented 6 years ago

I added the arabic light stemmer at separate files as you suggested , added unit tests for it, and removed the analyzer changes.

[Legacy Jira: Ayah Shamandi on Nov 22 2017]

mikemccand commented 6 years ago

Github user Ashamandi commented on the issue:

https://github.com/apache/lucene-solr/pull/274

I have handled your comments, please review ... Thank you!

[Legacy Jira: ASF GitHub Bot on Nov 22 2017]

mikemccand commented 6 years ago

@rcmuir, do you have any comment, Please?

[Legacy Jira: Ayah Shamandi on Nov 26 2017]

mikemccand commented 6 years ago

Hello, sorry for the slow response! (holiday times here).

I took a glance and it is shaping up well! I saw a little funky formatting we may want to tackle, perhaps there were some stray tabs or just indentation was off? And can you try running ant precommit from the top of your git checkout? It takes a few minutes, but will run some code analysis checks such as javadocs and the like, it may find stuff.

I will carve out some time in the next few days to take a deeper look. Thanks for putting in the hard work.

[Legacy Jira: Robert Muir (@rmuir) on Nov 26 2017]

mikemccand commented 6 years ago

@rcmuir, Thank you for your help, I run 'ant precommit' ~~it was really helpful~~, replaced all tabs with spaces and added a new commit, once you have a time, please check it.

[Legacy Jira: Ayah Shamandi on Nov 27 2017]

mikemccand commented 6 years ago

[~rcmuir ] , Has you got an opportunity to review?

[Legacy Jira: Ayah Shamandi on Dec 06 2017]

mikemccand / stargazers-migration-test

Arabic Stemmer improvement for Better Search Accuracy [LUCENE-8028] #29

Legacy Jira details