mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
281 stars 87 forks source link

Add Hausa language support #75

Closed rahulbot closed 7 years ago

rahulbot commented 8 years ago

Research folks say we'll soon be adding Hausa sources. Can you please add this preliminary short list of Hausa stopwords: https://github.com/stopwords-iso/stopwords-ha/blob/master/raw/gh-stopwords-json-ha.txt

pypt commented 8 years ago

Only Latin, or Arabic script support too?

Article says that Presently, the Romanized orthography, called boko, is used, since its introduction by the British at the beginning of the twentieth century (Newman 2000; Jaggar 2001).

pypt commented 8 years ago

Also, does anyone have access to Springer? I'd like to read through Stemming Hausa text: using affix-stripping rules and reference look-up (2016). Skipping stemming altogether doesn't seem like an option because it says in the preface that Hausa is highly inflectional.

Never mind, found a full article.

pypt commented 8 years ago

Emailed a couple of authors from papers that I've found on Hausa stemming, hopefully they'll reply with reference stemmer implementations.

hroberts commented 8 years ago

Terrific!

Here's the above article:

https://www.researchgate.net/profile/Norisma_Idris/publication/280776915_Stemming_Hausa_text_using_affix-stripping_rules_and_reference_look-up/links/565fef5b08ae4988a7befc7f.pdf

FYI, you can find the full text for most journal papers by searching for the quoted article name on scholar.google.com. I can also download stuff through harvard if there is not a public version, but the first thing I do is look on scholar.google.com because the harvard interface is more difficult.

-hal

On Wed, Nov 16, 2016 at 7:53 AM, Linas Valiukas notifications@github.com wrote:

Emailed a couple of authors from papers that I've found on Hausa stemming, hopefully they'll reply with reference stemmer implementations.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/75#issuecomment-260951623, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT2x3I8dGo_pSSqQN1MunFUABCy83ks5q-wrBgaJpZM4KzJ4t .

Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

pypt commented 8 years ago

Borrowing the idea from https://github.com/berkmancenter/mediacloud/issues/73#issuecomment-261213650, we might be able to use Hunspell to do some Hausa stemming. However, I have only found a Hunspell dictionary for the Ghana dialect of the Hausa language. It is a part of this Firefox extension and doesn't even have an explicit license set to it.

Do we have a native Hausa speaker around that we could consult about this?

Or maybe I should go without any kind of stemming and be done with it?

hroberts commented 8 years ago

You should email Fernado and Ethan and ask them about getting access to a hausa speaker to help you with these questions.

Also, I think asking folks to add wildcards to their search queries is fine. We just need stemming for filtering the resulting word counts.

-hal

On Thu, Nov 17, 2016 at 4:59 AM, Linas Valiukas notifications@github.com wrote:

Borrowing the idea from #73 (comment) https://github.com/berkmancenter/mediacloud/issues/73#issuecomment-261213650, we might be able to use Hunspell to do some Hausa stemming. However, I have only found a Hunspell dictionary for the Ghana dialect of the Hausa language. It is a part of this Firefox extension https://addons.mozilla.org/en-US/firefox/addon/hausa-spelling-dictionary/ and doesn't even have an explicit license set to it https://fedoraproject.org/wiki/OpenOffice.org/LinguisticComponents.

Do we have a native Hausa speaker around that we could consult about this?

Or maybe I should go without any kind of stemming and be done with it?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/75#issuecomment-261218112, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT8Ed9qRv3Nch5aM1rfL6wVY2b3ibks5q_DOogaJpZM4KzJ4t .

Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

pypt commented 7 years ago

Andrew Bimba, one of the co-authors of Stemming Hausa text: using affix-stripping rules and reference look-up (2016), replied to me agreeing to help with the stemmer. Woo!

Emailed to Fernando & Ethan too.

pypt commented 7 years ago

Woo, we now have a GPL 3-licensed stemmer by Andrew Bimba et al!

I'll ask him whether we can publish his stemmer on GitHub as an open-source module. It will take me some time to do some code cleanup too.

pypt commented 7 years ago

Cleaned up the code a little, covered with unit tests (extracted from the very same code), posted to GitHub (will more the repo to berkmancenter/) and PyPi's testing site. Had lunch with Andrew too.

Planning to integrate the stemmer via Inline::Python (which now has memory leaks patched).

pypt commented 7 years ago

Done, deployed. Language code is ha.

pypt commented 7 years ago

(Note: if you want the test suite to pass on local dev machines, do pip install hausastemmer.)