Closed rahulbot closed 7 years ago
Only Latin, or Arabic script support too?
Article says that Presently, the Romanized orthography, called boko, is used, since its introduction by the British at the beginning of the twentieth century (Newman 2000; Jaggar 2001).
Also, does anyone have access to Springer? I'd like to read through Stemming Hausa text: using affix-stripping rules and reference look-up (2016). Skipping stemming altogether doesn't seem like an option because it says in the preface that Hausa is highly inflectional.
Never mind, found a full article.
Emailed a couple of authors from papers that I've found on Hausa stemming, hopefully they'll reply with reference stemmer implementations.
Terrific!
Here's the above article:
FYI, you can find the full text for most journal papers by searching for the quoted article name on scholar.google.com. I can also download stuff through harvard if there is not a public version, but the first thing I do is look on scholar.google.com because the harvard interface is more difficult.
-hal
On Wed, Nov 16, 2016 at 7:53 AM, Linas Valiukas notifications@github.com wrote:
Emailed a couple of authors from papers that I've found on Hausa stemming, hopefully they'll reply with reference stemmer implementations.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/75#issuecomment-260951623, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT2x3I8dGo_pSSqQN1MunFUABCy83ks5q-wrBgaJpZM4KzJ4t .
Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University
Borrowing the idea from https://github.com/berkmancenter/mediacloud/issues/73#issuecomment-261213650, we might be able to use Hunspell to do some Hausa stemming. However, I have only found a Hunspell dictionary for the Ghana dialect of the Hausa language. It is a part of this Firefox extension and doesn't even have an explicit license set to it.
Do we have a native Hausa speaker around that we could consult about this?
Or maybe I should go without any kind of stemming and be done with it?
You should email Fernado and Ethan and ask them about getting access to a hausa speaker to help you with these questions.
Also, I think asking folks to add wildcards to their search queries is fine. We just need stemming for filtering the resulting word counts.
-hal
On Thu, Nov 17, 2016 at 4:59 AM, Linas Valiukas notifications@github.com wrote:
Borrowing the idea from #73 (comment) https://github.com/berkmancenter/mediacloud/issues/73#issuecomment-261213650, we might be able to use Hunspell to do some Hausa stemming. However, I have only found a Hunspell dictionary for the Ghana dialect of the Hausa language. It is a part of this Firefox extension https://addons.mozilla.org/en-US/firefox/addon/hausa-spelling-dictionary/ and doesn't even have an explicit license set to it https://fedoraproject.org/wiki/OpenOffice.org/LinguisticComponents.
Do we have a native Hausa speaker around that we could consult about this?
Or maybe I should go without any kind of stemming and be done with it?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/75#issuecomment-261218112, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT8Ed9qRv3Nch5aM1rfL6wVY2b3ibks5q_DOogaJpZM4KzJ4t .
Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University
Andrew Bimba, one of the co-authors of Stemming Hausa text: using affix-stripping rules and reference look-up (2016), replied to me agreeing to help with the stemmer. Woo!
Emailed to Fernando & Ethan too.
Woo, we now have a GPL 3-licensed stemmer by Andrew Bimba et al!
I'll ask him whether we can publish his stemmer on GitHub as an open-source module. It will take me some time to do some code cleanup too.
Cleaned up the code a little, covered with unit tests (extracted from the very same code), posted to GitHub (will more the repo to berkmancenter/
) and PyPi's testing site. Had lunch with Andrew too.
Planning to integrate the stemmer via Inline::Python (which now has memory leaks patched).
Done, deployed. Language code is ha
.
(Note: if you want the test suite to pass on local dev machines, do pip install hausastemmer
.)
Research folks say we'll soon be adding Hausa sources. Can you please add this preliminary short list of Hausa stopwords: https://github.com/stopwords-iso/stopwords-ha/blob/master/raw/gh-stopwords-json-ha.txt