mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
280 stars 87 forks source link

Hebrew support? #80

Closed rahulbot closed 6 years ago

rahulbot commented 7 years ago

What kind of support do we have right now for Hebrew sources?

I've been having a conversation on twitter with folks about stemming and stop-wording Hebrew. @habeanf suggested his package (in Go), but that seems overly complex for us to interface with :-( Also it says in big gold letters "DO NOT USE FOR PRODUCTION".

habeanf commented 7 years ago

For many languages the stemming and stop-wording concepts do not carry over well from English. What you're looking for is morphological analysis and disambiguation (perhaps lemmatization as well).

FYI, yap (the parser in Go) can also be applied to ~40 languages, but it's bread and butter is morphologically rich languages like Hebrew.

It is indeed not ready for production yet, but for your purposes it might be relevant. We can provide python bindings to the relevant Go functions but I suspect programming language integration isn't a real issue. We can always configure it to run on your server as a process such that you can call out to it.

hroberts commented 7 years ago

We have no stemming or stopwords for hebrew. Would love to add them, but off the top of my head, I don't think our limited use of hebrew is worth a process sitting around using 4GB of memory. Any way you can chop that way down?

-hal

On Wed, Dec 7, 2016 at 9:07 AM, Amir More notifications@github.com wrote:

For many languages the stemming and stop-wording concepts do not carry over well from English. What you're looking for is morphological analysis and disambiguation (perhaps lemmatization as well).

FYI, yap (the parser in Go) can also be applied to ~40 languages, but it's bread and butter is morphologically rich languages like Hebrew.

It is indeed not ready for production yet, but for your purposes it might be relevant. We can provide python bindings to the relevant Go functions but I suspect programming language integration isn't real issue. We can always configure it to run on your server as a process such that you can call out to it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/80#issuecomment-265471111, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT3PEGruGt2yseObVdZCegnVwgxU3ks5rFsupgaJpZM4LGrbh .

-- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

habeanf commented 7 years ago

Research is ongoing to replace the current 4GB model (structured perceptron with engineered features) with deep NN, but that could take a few months. It could be more relevant as an external process/service running on another server. 4GB of memory isn't that hard to come by these days, although it is a lot for just one language. Could mediacloud be setup to query a given outside service? So if someone like Mushon or me wanted to run a local Israeli copy we could manage this service ourselves?

Meanwhile, if you intend to apply mediacloud.org to languages with morphology, I strongly suggest consulting with a linguist about stemming and stopwords. These are problematic concepts that work well for English and a handful of other languages but will be a source of problems down the line for more complex languages. I think lemmas are a more applicable term than stems, but that's just semantics. For stopwords, a simple list in Hebrew (and many other languages) will not suffice.

rahulbot commented 6 years ago

Closing because we don't have resources to implement this well for now. If a specific research project or collaborator comes up that requires Hebrew support then we'll revisit this.