koheiw / newsmap

Semi-supervised algorithm for geographical document classification
Other
59 stars 22 forks source link

Check false matches by * in Japanese #28

Open koheiw opened 5 years ago

koheiw commented 5 years ago

"タイ*" for Thailand produces a lot of false matches. For example, "タイヤ" (tire), "タイム" (time), "タイミング" (timing), "タイプ" (type), "タイトル" (title), "タイガー" (tiger).

This is a good reminder that we have to careful about wildcard. We need to check words for other countries too.

koheiw commented 5 years ago

The chance of false match increases when we use * but need of wildcard depends on how Japanese words are segmented in tokenization. Below is the code to test if country names are isolated from following elements. For example we need * for Japan because tokens() does not separate "日本人" to "日本" and "人" while "アメリカ人" becomes "アメリカ" and "人".

require(quanteda)
#> Loading required package: quanteda
#> Package version: 1.4.4
#> Parallel computing: 2 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View
require(newsmap)
#> Loading required package: newsmap
require(stringi)
#> Loading required package: stringi

lis <- as.list(data_dictionary_newsmap_ja, TRUE, 3) %>% 
       lapply(function(x) stri_replace_last_fixed(x[1], "*", ""))

# followed by kanji (country names as part of demonym)
people_fixed <- unlist(lis) %>% 
    paste0("人") %>% 
    tokens() %>% 
    tokens_lookup(dictionary(lis)) %>% 
    ntoken()

people_glob <- unlist(lis) %>% 
    paste0("人") %>% 
    tokens() %>% 
    tokens_lookup(data_dictionary_newsmap_ja) %>% 
    ntoken()

(missed_people <- names(lis)[people_glob > 0 & people_fixed == 0])
#> [1] "CD" "CG" "ST" "PM" "KG" "JP" "MP"

# followed by katakana (country names as adjectives)
team_fixed <- unlist(lis) %>% 
    paste0("チーム") %>% 
    tokens() %>% 
    tokens_lookup(dictionary(lis)) %>% 
    ntoken()

team_glob <- unlist(lis) %>% 
    paste0("チーム") %>% 
    tokens() %>% 
    tokens_lookup(data_dictionary_newsmap_ja) %>% 
    ntoken()

(missed_team <- names(lis)[team_glob > 0 & team_fixed == 0])
#>  [1] "MG" "YT" "CD" "CG" "ST" "AI" "BQ" "PM" "KG" "GG" "MP" "NU" "TK"

union(missed_people, missed_team) # countries that need wildcard
#>  [1] "CD" "CG" "ST" "PM" "KG" "JP" "MP" "MG" "YT" "AI" "BQ" "GG" "NU" "TK"

Interestingly, it is not only tokens() but Mecab also works in the similar manner.

日本人
日本人  名詞,一般,*,*,*,*,日本人,ニッポンジン,ニッポンジン

アメリカ人
アメリカ        名詞,固有名詞,地域,国,*,*,アメリカ,アメリカ,アメリカ
人      名詞,接尾,一般,*,*,*,人,ジン,ジン
koheiw commented 5 years ago

@ClaudeGrasland here is the comparison between new and old.

image

There are large increase in small insular countries in the new version because I treated their names as phrases. The increase in Madagascar and Germany is due to wrong translation in the old version. Removal of wildcard affects little when tokens are not compounded but the impact is when they are compounded. In Thailand, for example, -0.02% with non-compunded tokens, but -24% with compounded tokens. This is because "タイ" matches only "タイ" "軍" (Thai military), not "タイ軍". This is a tricky issue.

> diff["kh"]
          kh 
-0.000264131 
> diff2["kh"]
        kh 
-0.2463005 

I produced this plot in https://github.com/koheiw/newsmap/blob/issue-28/tests/misc/comapre-dictionaries.R