mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
277 stars 87 forks source link

Add Japanese language support #113

Closed pypt closed 7 years ago

pypt commented 7 years ago

Needed:

Coordinate with Mika Kanaya & Rahul.

pypt commented 7 years ago

It looks like the go-to solution for Japanese processing is kuromoji, used by Lucene and other projects.

kuromoji is in Java, so it would take 2-3 days to attach it to Python code (via light REST service, PyJNIus, JPype etc.). Also add some overhead for having to work with a language that I can't read.

@hroberts, should I proceed?

pypt commented 7 years ago

Sent email to Mika, waiting for confirmation.

pypt commented 7 years ago

Mika's email:

=Tokenization, stammering, tagging -- Mecab For sentence tokenization and other functions, Mecab seems to do job fast and it works with Python. It also contains a default system dictionary called mecab-ipadic. mecab https://pypi.python.org/pypi/mecab-python/0.996 mecab-ipadic mecab-ipadic-2.7.0-20070801.tar.gz << http://taku910.github.io/mecab/#download

However, we also need to download additional dictionary called mecab-ipadic-neologd, another customized system dictionary for MeCab. This includes many neologisms (new word), which are extracted from many language resources on the Web. It also contains proper nouns whereas the other dictionary does not. Also this dictionary gets updated weekly. When we analyze the web pages, it's better to use this system dictionary and default one (ipadic) together. https://github.com/neologd/mecab-ipadic-neologd

=Dictionary updates Mecab-ipadic-neologd is updated twice a week, every Mondays and Thursdays Japan time. It would be ideal to update the dictionary on Tuesdays and Fridays US time. Below Dockerfile includes some sources of someone who created the method. http://qiita.com/matsulib/items/5249b5f3e832f0311806 git https://github.com/matsulib/mecab-service

=Character code Please note that mecab's default is EUC-JP, the encoding needs to be changed to UTF-8. Same with the dictionaries.

=Stop word The above morphological analyzer tags each word and all we need is to eliminate the tags we do not need. I cannot write the python code, but I can give you the ID numbers of each tags we need. There seem to be categorized into 68 tags and we only want the below: 36,38,40,41,42,43,44,45,46,47 I will do more research how they should be defined, whether to write them in Japanese words or in ID numbers.

pypt commented 7 years ago

Mecab's memory requirements:

screenshot 2017-03-27 23 57 54

(https://github.com/neologd/mecab-ipadic-neologd#memory-requirements)

I wonder how much memory it will use up in production.

pypt commented 7 years ago

https://github.com/berkmancenter/mediacloud/tree/japanese_support

pypt commented 7 years ago

Basic guesswork-based implementation done, testing and integrating now.

pypt commented 7 years ago

Wanted to finish up initial Japanese support Real Quick (c) and then go do auth API (#131), apparently there was more work than I have expected, but now it's more or less ready.

@hroberts, heads-up: I had to remove language from wc/list and change ::Solr::WordCounts implementation to work with Japanese and other languages, in which 1) there exist single character words, and 2) words are not split by whitespace. See 4789999 commit message for details. As noted in the commit message, performance of count_stems() remains the same, it's just that we'll need to add more English language stopwords to reduce the noise in the word cloud.

Sample wc/list?q=sentence:離脱 response for Japanese: https://gist.github.com/pypt/42d83f4885f80b5201ff1d3999d749fc (sample dataset with asahi.com as a sole media source)

Sample wc/list?q=sentence:obama response for English: https://gist.github.com/pypt/71ba9793f62a25d2007ac65376b64509 (sample "Obama controversy" dataset that I have)

pypt commented 7 years ago

Naturally, there are some ways to still improve (e.g. some strange full-width percentage signs and periods are treated as tokens), but I would show it to Mika first.

hroberts commented 7 years ago

Great to see this progress. Will be exciting to see if we can get useful resulting in an ideographic language.

We can't allow words shorter than 3 characters for non ideographic languages. That will cause all kinds of artifacts like we see in the obama results above, as well as surfacing pure noise artifacts in many cases. If we try to stopword our way out of the problem, we'll end up gradually adding all two character combinations.

We need to figure out a way in the code to identify the ideographic languages and make an exception for them for the 1 character rule. Probably that will require keeping all of the short words initially and then filtering them out of any sentences that are not detected as an ideographic language. This is not perfect since we miss identify the language of a fair number of individual sentences, but it will result in less problems than adding 1 and 2 character words back to the other languages.

A design consideration for the japanese language stuff should be that the vast majority of our research is done in non-ideographic languages, so a hard rule should be that nothing in the japanese language support decreases the quality of our results for other languages.

On Wed, Mar 29, 2017 at 6:40 PM, Linas Valiukas notifications@github.com wrote:

Naturally, there are some ways to still improve (e.g. some strange full-width percentage signs and periods are treated as tokens), but I would show it to Mika first.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/113#issuecomment-290259040, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvTyF71IYFXEcn_ZYF28a1ciD1VW2qks5rquv6gaJpZM4L7sJt .

-- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

pypt commented 7 years ago

What if the user wants to research a topic in which UN (United Nations), AA (Alcoholics Anonymous) or some other abbreviation is prominent? Will we continue to secretly hide those abbreviations from them?

We actually have all those one-two-three word stopwords in the "tiny" and "short" stopword lists. wc/list removes only the "long" stopwords so that's why you see artifacts like "can't" and "mr" in the output. Merging those stopword lists will definitely help if not solve the problem.

pypt commented 7 years ago

Sorry, I might have come off too blunt in my last comment.

My point was that the current implementation (trimming very short words) is usually called a "naive" one in literature (which I've read five years ago while doing the GSoC project for Media Cloud), as it cuts down too much terms, some of which you would like to leave (e.g. abbreviations like UN, US, USA, etc.). We simply need to merge three stopword lists that we already have into one, maybe add a few more words to the list too. Sure, the word cloud will change afterwards, but it's for the better as we would then have terms in it that we have missed previously (e.g. "USA"). For example, one searching for "Israel" right now won't even see "UN" ("Human Rights Council") in the word cloud.

Also, the reason of language modules under ::Language is to enclose all the language-dependent code into one place. is_ideographic() and other exceptions scattered in the code would be a very unclean approach (the one that I got rid of while creating those language modules - I recall seeing is_russian(), is_cyrillic() and similar helpers). Lucene manages to do it for all of the exotic languages it supports, no reason for us not to do (keep on doing it) too.

@rahulbot the-upcoming-CTO, what do you think?

rahulbot commented 7 years ago

I say we need to design for the most common case here - which is non-ideographic languages. Having < 3 letter words in english results makes them far less useful. As @hroberts says, priority is to make sure the english results for word count don't get worse.

The current behavior, while not ideal, has not been a huge problem. Sure, we miss "UN" in word count results when searching for something like the current secretary general ("António Guterres") but you get lots of other words suggesting that is his role ("minister", "spokesman", "ambassador", "humanitarian").

The quick fix to make sure current "naive" behavior is maintained is to simple filter out words < 2 characters if it is non-ideographic. Queue up a bug to solve it the "right" way you propose and we'll come back to it.

pypt commented 7 years ago

I have already solved it the right way (except for merging three stopword text files into one), you just downright refuse to look into the new word cloud before making your judgement.

hroberts commented 7 years ago

Hi Linas,

My experience is that the vast majority of the 1- and 2- letter results we get are noise -- either in the sense of being something we should stopword or commonly just some artifact of the dirty data coming in or the way that we are processing that data. I just scanned through the first 20 two or one letter results in the obama count above, and 18 / 20 of them are noise of some sort. The exceptions at 'va' and 'al', which I'm presuming are state abbreviations.

The 'obama' search is a best case for this problem. Our results have a higher chance of being noisy the more specific (in sources and in keyword) they get. So some keywords / sources will be much worse than this. It is important that our results are reasonable for the largest range of possible queries, even at the cost of losing some of the signal. This is the philospphy we apply in using the big stopword list for english -- there are lots of cases in which we would like to see some of those stopwords, but in general having the giant stopword case gives us reasonable results for a very wide range of possible queries.

The nature of media cloud, in that it collects open, unstructured data and then uses imperfect nlp to make sense of that data, is that we constantly have to balance signal vs. noise. Often that involves losing some signal to make sure that we don't get drowned in the noise. Unless the given signal is really critical for some point, any time we can get rid of a set of data that is 90% noise, it's a wine.

I'm not sure I understand your reluctance to treat ideographic languages differently than alphabetic ones. In a general sense, it seems cleanest to me to deal with the structural differences in the language rather than pretending that they are the same. Just from a code managment point of view, it seems vastly cleaner to me to add a few extra if statements (in this case just one!) instead of thousands of lines to various stopword files.

-hal

On Sat, Apr 1, 2017 at 12:00 PM, Linas Valiukas notifications@github.com wrote:

I have already solved it the right way (except for merging three stopword text files into one), you just downright refuse to look into the new word cloud before making your judgement.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/113#issuecomment-290932837, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT1lSs0e4y2pfNiq7gV6bbXCkREfjks5rroLIgaJpZM4L7sJt .

-- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

pypt commented 7 years ago

I, for my part, don't understand why are you so affixed to the number 3 that you have magic-numbered in multiple places into ::Solr::WordCounts some time in 2014.

I'm aware that NLP is imperfect art, but there's no need to make it even less perfect by blindly assuming that in all languages (Latin and non-Latin) words will be split by \s, words that you want are longer than ARBITRARY_NUMBER of characters, and also breaking the modular language support by making exceptions over exceptions over exceptions straight into the calling code.

I did what I (and most other software packages in the world, including but not limited to Solr) consider a sensible thing (and what I was suggesting from the start):

  1. Merged "tiny", "short" and "long" stopword lists (we had three for whatever historical reason) into a single list. We have been using only the "long" stopword list, letting terms from "tiny" and "short" stopword lists go through into the word cloud.
  2. Added around 5% more stopwords from external sources after 2 mins of googling.

Here's the word list for wc/list?q=sentence:obama now:

https://gist.github.com/pypt/c9aa037a2d38b1d9bb2d9d3f561b650d

Terms of three characters or less (neither of them would show up in the word cloud if terms with <= 3 characters would continue to be removed):

So, more signal than noise, no?

To summarize, if we continued to skip terms of <= 3 character length:

  1. We would continue missing out on Virginia, GOP, Keystone XL, NSA, Fox News, CNN, Trump's tax cuts (both tax and cut are only three characters, so there's no good way to infer the subject matter from other words) and other short terms.
  2. We would break separation of concerns and language code encapsulation by introducing (and encouraging further) helpers like is_ideographic() (from which we have recovered years ago by moving all the language code into ::Language).

On the other hand, if we went for having a stopword list:

  1. We would do things exactly how the big boys around us (Solr) do it.
  2. Language code remains under ::Language, split into separate classes (to avoid having trees of duplicate conditionals scattered around code, a big code smell).
  3. No need to make exceptions for Far East languages.
  4. Japanese works fine.
  5. English continues to work fine, some more previously skipped yet important terms show up in the word cloud.
  6. If we see a term that is a stopword (independently from its length), we add it to the stopword list (just like we do now).
hroberts commented 7 years ago

I agree that we should keep the three letter words. That's what we do now. I remain worried about the 2 letter ones, though.

I don't think solr's indexing behavior is relevant for our case. We aren't worried about what to search for. We are worried about how to display word clouds from the search. There is no universal consensus for how to generate automated word clouds generated from large scale open web, content because there are only a handful of folks doing it.

There's not much of a good reason for the short list, but I think the tiny list is sometimes useful for nlp purposes. Most similarity algorithms do best with no stopwords, but some do better taking out a tiny list of stopwords. We don't do much work with similarity algorithms right now, though, and the difference between no stopwords and tiny stopwords is generally quite small, so I'm fine with simplifying the code.

Is there any case for returning 1 letter words, or are you just proposing adding every lone letter as a stopword?

I'm happy to revisit my early experiences that led me to chop out the two letter words. Our data has certainly changed a lot over the past few years, so it may be clean enough now that most of our two letter results will be relevant (or it may be the case that I just made bad decisions way back when).

But I would like to do some validation to make sure we're not inserting noise words into a bunch of our word clouds (and that we're not going to have to grow to add hundreds of two letter noise stopwords to our stopword lists to avoid that noise). Those word clouds are among our most visible results, so I don't think we should make big changes in them without testing to make sure we're not introducing embarrassing problems.

I think the query you have chosen is a best case example for what you are proposing -- 'obama' as a lone query returns a huge number of sentences, which are much less likely to suffer from source- or topic- specific artifacts. The harder cases that I tuned the system for are the more specific queries that are more likely to dominated by ugly artifacts.

I think something like this would be a good validation:

-hal

On Mon, Apr 3, 2017 at 6:14 AM, Linas Valiukas notifications@github.com wrote:

I, for my part, don't understand why are you so affixed to the number 3 that you have magic-numbered in multiple places into ::Solr::WordCounts some time in 2014.

I'm aware that NLP is imperfect art, but there's no need to make it even less perfect by blindly assuming that in all languages (Latin and non-Latin) words will be split by \s, words that you want are longer than ARBITRARY_NUMBER of characters, and also breaking the modular language support by making exceptions over exceptions over exceptions straight into the calling code.

I did what I (and most other software packages in the world, including but not limited to Solr) consider a sensible thing (and what I was suggesting from the start):

  1. Merged "tiny", "short" and "long" stopword lists (we had three for whatever historical reason) into a single list. We have been using only the "long" stopword list, letting terms from "tiny" and "short" stopword lists go through into the word cloud.
  2. Added around 5% more stopwords from external sources after 2 mins of googling.

Here's the word list for wc/list?q=sentence:obama now:

https://gist.github.com/pypt/c9aa037a2d38b1d9bb2d9d3f561b650d

Terms of three characters or less (neither of them would show up in the word cloud if terms with <= 3 characters would continue to be removed):

  • va (18) - Virginia
  • gop (11) - Republicans
  • xl (10) - Keystone
  • tax (10 - tax cuts?
  • fix (8) - Trump's favourite word
  • fox (6) - Fox News
  • nsa (6) - world's best email backup
  • cnn (5) - CNN
  • kay (5) - Michael Kay?
  • bin (4)
  • nbc (4) - NBC
  • ii (4)
  • ban (3)
  • san (3)
  • bob (3)
  • los (3)
  • tim (3)
  • tea (3)
  • 1st (2)

So, more signal than noise, no?

To summarize, if we continued to skip terms of <= 3 character length:

  1. We would continue missing out on Virginia, GOP, Keystone XL, NSA, Fox News, CNN, Trump's tax cuts (both tax and cut are only three characters, so there's no good way to infer the subject matter from other words) and other short terms.
  2. We would break separation of concerns and language code encapsulation by introducing (and encouraging further) helpers like is_ideographic() (from which we have recovered years ago by moving all the language code into ::Language).

On the other hand, if we went for having a stopword list:

  1. We would do things exactly how the big boys around us (Solr) do it.
  2. Language code remains under ::Language, split into separate classes (to avoid having trees of duplicate conditionals scattered around code, a big code smell).
  3. No need to make exceptions for Far East languages.
  4. Japanese works fine.
  5. English continues to work fine, some more previously skipped yet important terms show up in the word cloud.
  6. If we see a term that is a stopword (independently from its length), we add it to the stopword list (just like we do now).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/113#issuecomment-291114169, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT-2cT0Wjw4mvMvDM5agwTIME_2UDks5rsNShgaJpZM4L7sJt .

-- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

rahulbot commented 7 years ago

Great example Linas. We're all agreed that 3-letter words are worth including (as we do now). Is your argument about those that we need a better stopword list for them?

Regarding 2-letter words: I checked with Anushka here and she agrees that the way to move forward is to do some testing on existing topics to compare existing-solution vs. proposed-solution. Particularly she suggested making sure to include a non-US focused topic (like "India rape" or something); because many Indian papers have fewer standards around abbreviations and use 2-letter abbreviations a lot more than US papers (ie. HC meaning "high court").

pypt commented 7 years ago

I disagree, I really see no good reason to go through an extensive validation of two-character words, especially as a part of this task. What will it achieve? Will Mika get her Japanese support sooner? Will either of us get a published article out of that? This is way out of scope of the task at hand (add Japanese support), would take at least another week, block everyone involved, and prolong this flamewar that we're having even further.

Currently the English stopword list contains all 26 one-character words (from this source) and 58 two-character words taken from various sources, all of them (as, at, be, ...) seem to be sensible choices of the original stopword list authors, thus making our list quite extensive. The rest (26 * 26 - 58 = 618) of the two character combinations are not there, but they will be either 1) so rare that they won't even show up in the word cloud (unless we do a bad job at, say, HTML processing and let stuff like <li> through), or 2) the ones that we want (e.g. xl for Keystone XL).

So, I'd say we deploy this Japanese branch with updated tokenization and stopword removal and see what happens. It has Japanese text-sentence and sentence-word tokenization, the language module structure remains tight, word clouds will look mostly the same, sentence tokenization and stopword removal has been improved (e.g. right now even the word wasn't isn't being removed, and word non-discriminatory gets split into non and discriminatory due to (\w+)).

If it comes to that, we can add the rest of the two character words to the stopword list over the period of 15 minutes to get exactly the behavior that we currently have (if you insist, we can do it even now). Those words would take up a whopping extra 1 KB of RAM (including NUL bytes and hashref overhead), and we would still have a discrete stopword list to point to.

hroberts commented 7 years ago

I think we should have a general rule that if there are reasonable concerns about impacts on production research results from some change, we need to validate the changes before putting them into production. In this case in particular, I think there is a significant risk that the changes would degrade the results for some types of queries without us knowing about it until reports filtered up from users, including for U.S. and international results as raised by Anushka.

There has to be some balance here, of course. There may be theoretical changes that are necessary for architectural reasons and that are prohibitively expensive to test well. But I don't think that's the case here. I think the validation I've proposed is about half a day of work (couple hours to script the data generation, couple hours to look at the results).

-hal

On Tue, Apr 4, 2017 at 9:05 AM, Linas Valiukas notifications@github.com wrote:

I disagree, I really see no good reason to go through an extensive validation of two-character words, especially as a part of this task. What will it achieve? Will Mika get her Japanese support sooner? Will either of us get a published article out of that? This is way out of scope of the task at hand (add Japanese support), would take at least another week, block everyone involved, and prolong this flamewar that we're having even further.

Currently the English stopword list contains all 26 one-character words (from this source https://github.com/arc12/Text-Mining-Weak-Signals/wiki/Standard-set-of-english-stopwords) and 58 two-character words taken from various sources, all of them (as, at, be, ...) seem to be sensible choices of the original stopword list authors, thus making our list quite extensive. The rest (26 * 26 - 58 = 618) of the two character combinations are not there, but they will be either 1) so rare that they won't even show up in the word cloud (unless we do a bad job at, say, HTML processing and let stuff like

  • through), or 2) the ones that we want (e.g. xl for Keystone XL).

    So, I'd say we deploy this Japanese branch with updated tokenization and stopword removal and see what happens. It has Japanese text-sentence and sentence-word tokenization, the language module structure remains tight, word clouds will look mostly the same, sentence tokenization and stopword removal has been improved (e.g. right now even the word wasn't isn't being removed, and word non-discriminatory gets split into non and discriminatory due to (\w+)).

    If it comes to that, we can add the rest of the two character words to the stopword list over the period of 15 minutes to get exactly the behavior that we currently have (if you insist, we can do it even now). Those words would take up a whopping extra 1 KB of RAM (including NUL bytes and hashref overhead), and we would still have a discrete stopword list to point to.

    — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/113#issuecomment-291510362, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT7pv-GlPvyzzx4ESnWrZ79gre_Fgks5rsk4xgaJpZM4L7sJt .

  • -- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

    pypt commented 7 years ago

    I think something like this would be a good validation

    Here's the validation:

    https://docs.google.com/spreadsheets/d/14J8qM2WzYJvkORyVaG0t6euzoVC9XoO0mf40zQ5LAls/edit?usp=sharing

    Stopword count in the "after" (with my fixes) test is slightly smaller, nearing the margin of error though.

    Please also note that the "after" word list includes terms such as president-elect, anti-immigration, cancer-causing etc. which are not present in the current word cloud (due to sentences being split into words using (\w+)).

    hroberts commented 7 years ago

    Thanks Linas. This looks good to me. Please merge and deploy.

    I'm trying to build up more content in the doc/validate/ directory so that we can refer back to these exercises over time. It would be great if you could spend 15 minutes adding a directory with this validation info and data there, but we don't have to block on that.

    -hal

    On Fri, May 12, 2017 at 8:35 PM, Linas Valiukas notifications@github.com wrote:

    I think something like this would be a good validation

    Here's the validation:

    https://docs.google.com/spreadsheets/d/14J8qM2WzYJvkORyVaG0t6euzoVC9X oO0mf40zQ5LAls/edit?usp=sharing

    Stopword count in the "after" (with my fixes) test is slightly smaller, nearing the margin of error though.

    Please also note that the "after" word list includes terms such as president-elect, anti-immigration, cancer-causing etc. which are not present in the current word cloud (due to sentences being split into words using (\w+)).

    — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/113#issuecomment-301217448, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT66nwPQSjtEpnJJm1CdI3LpPxjQWks5r5QjYgaJpZM4L7sJt .

    -- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

    pypt commented 7 years ago

    Done and deployed. Some notes:

    pypt commented 7 years ago

    Reopening as Japanese stopwords doesn't seem to be removed: http://bit.ly/2pSWNU4

    pypt commented 7 years ago

    Done, fixed a typo.