Support request: tokenization

nick-magnini commented 8 years ago

It would be great to add the org.apache.lucene.analysis for smarter tokenization for all languages. In this way, processing other languages such as Chinese is more sensible with your library.

dav009 commented 8 years ago

It would be good to amke the pipeline more independent. Are you working with Japanese/Korean/Chinese langs ?

nick-magnini commented 8 years ago

Yes I do. I usually use the org.apache.lucene.analysis for various languages.

dav009 commented 8 years ago

great to know that.does the current pipeline actually gets something out that is not garbage for those langs? (Ive only played with a few of the most remarkable european languages)

nick-magnini commented 8 years ago

Go to check it but for languages that are mix of asian and english (e.g., wikipedia) usually smart chinese tokenizer from lucene works well and it's pretty fast and scalable

dav009 commented 8 years ago

@nick-magnini any recommendation on which tokenizer to use for this particular task:

Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.

ChineseAnalyzer (in the analyzers/cn package): Index unigrams (individual Chinese characters) as a token.
CJKAnalyzer (in this package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.
SmartChineseAnalyzer (in the analyzers/smartcn package): Index words (attempt to segment Chinese text into words) as tokens.
Example phrase： "我是中国人"
ChineseAnalyzer: 我－是－中－国－人
CJKAnalyzer: 我是－是中－中国－国人
SmartChineseAnalyzer: 我－是－中国－人

nick-magnini commented 8 years ago

Based on my experience, for wiki pages, since it's a mix for English and Chinese, SmartChineseAnalyzer works better. In addition Jieba is one of the best Chinese segmenters (tokenizers). Potentially to change Zh from traditional to simple and vice versa, Opencc is recommended.

dav009 commented 8 years ago

So then the best options are:

Jieba: https://github.com/huaban/jieba-analysis (tokenizing)
OpenCC https://github.com/BYVoid/OpenCC (simplifying)

I've not done much of Asian languages, but you are suggesting the pipeline to be:

tokenize + transform from traditional to simple ?

dav009 commented 8 years ago

Also I assume we are working on the Chinese wikipedia in this space zh There are others :

zh-yue (cantonese ?) :
wuu
gan

Does zh have any preference on waht kind of chinese to use(traditional/simple ) ?

nick-magnini commented 8 years ago

The transformation should be optional (future plan). simple <-> traditional should be done before tokenization. At this stage I don't think you should care about this stage. I don't bother with it.
Lucene analyzer is faster. Jieba is slower but more precise.

Now the choice is between Lucene or Jieba. In terms of scalability and efficiency, I'l vote for Lucene since it has CKJ support.

dav009 commented 8 years ago

Alright lets start with tokenization, So Im going to run the tool on ZH

dav009 commented 8 years ago

@nick-magnini generating the model now.

Here is a sample of the tokenization using SmartChineseAnalyzer. Worth knowing if it looks alright

DBPEDIA_ID/倫敦珍寶 倫 敦 珍 寶 DBPEDIA_ID/5月1日 5 月 1 日 葡萄牙语 维 基 百科 达到 40 000 条目 DBPEDIA_ID/4月30日 4 月 30 日 加 利 西亚 语 维 基 百科 达到 5 000 条目 爪哇 语 维 基 百科 达到 500 条目 DBPEDIA_ID/4月29日 4 月 29 日 威 尔 斯 语 维 基 百科 达到 3 000 条目 DBPEDIA_ID/4月28日 4 月 28 日 法语 维 基 语录 达到 2 000 条目 法语 维 基 字典 达到 5 000 条目 DBPEDIA_ID/4月27日 4 月 27 日 芬兰 语 维 基 百科 达到 20 000 条目 保加利亚 语 维 基 字典 达到 20 000 条目 维 基 百科 中文版 达到 26 000 条目 第 26 000 条目 是 user peterpan 创建 的 DBPEDIA_ID/骰寶 骰 寶 DBPEDIA_ID/4月22日 4 月 22 日 车臣 语 维 基 百科 誕 生 DBPEDIA_ID/4月18日 4 月 18 日 维 基 百科 中文版 达到 25 000 条目 第 25 000 条目 是 user hamham 创建 的 DBPEDIA_ID/托马斯·吉尔丁 托 马 斯 吉 尔 丁 DBPEDIA_ID/4月9日 4 月 9 日 维 基 百科 中文版 达到 24 000 条目 第 24 000 条目 是 user sl 创建 的 DBPEDIA_ID/太平山_(香港) 香港 山 頂 DBPEDIA_ID/4月7日 4 月 7 日 DBPEDIA_ID/4月6日 4 月 6 日 维 基 百科 中文版 条目 数 超过 丹麦 语 版 按照 条目 数 排名 位居 所有 语言 的 第 11 名 DBPEDIA_ID/4月4日 4 月 4 日 已经 有 50 本 课本 DBPEDIA_ID/3月31日 3 月 31 日 达到 500 词条 DBPEDIA_ID/3月26日 3 月 26 日 维 基 百科 中文版 达到 23 000 条目 第 23 000 条目 是 wangyunfeng 创建 的 DBPEDIA_ID/刀币 刀币 DBPEDIA_ID/3月25日 3 月 25 日 所有 语言 维 基 语录 达到 10 000 条目 DBPEDIA_ID/3月24日 3 月 24 日 塞尔维亚 语 维 基 百科 达到 10 000 条目 DBPEDIA_ID/3月22日 3 月 22 日 荷兰语 维 基 百科 达到 60 000 条目 波兰 语 维 基 百科 达到 60 000 条目 DBPEDIA_ID/3月21日 3 月 21 日 挪威 语 维 基 百科 达到 20 000 条目 DBPEDIA_ID/3月17日 3 月 17 日 英文版 维 基 百科 达到 500 000 条目 DBPEDIA_ID/3月12日 3 月 12 日 维 基 资源 对 是否 要 分设 语言 子 域名 准备 重新 开始 投票 DBPEDIA_ID/3月10日 3 月 10 日 維 基 百科 现在 排名 alexa 参考 网站 50 强 的 第 4 名 維 基 百科 中文版 達 到 22000 條 目 第 22000 条目 是 创建 的 DBPEDIA_ID/锆 锆 DBPEDIA_ID/3月9日 3 月 9 日 目前 按照 内部 链 接 数 排列 中文版 进入 前 10 名 位于 葡萄牙语 之前 意大利 语 之后 DBPEDIA_ID/3月5日 3 月 5 日 台湾 维 基 人 在 台北 聚会 DBPEDIA_ID/2月20日 2 月 20 日 維 基 百科 中文版 達 到 21000 條 目 DBPEDIA_ID/2月16日 2 月 16 日 維 基 百科 中文版 條 目 數 超 過 世界 語 版 DBPEDIA_ID/2月6日 2 月 6 日 維 基 百科 中文版 達 到 20000 條 目 DBPEDIA_ID/2月4日 2 月 4 日 达到 10000 个 页面 第 10000 个 页面 是 日 文 的 DBPEDIA_ID/1月26日 1 月 26 日 維 基 百科 中文版 達 到 19000 條 目 DBPEDIA_ID/1月10日 1 月 10 日 维 基 百科 中文版 达到 18000 条目 第 18000 条目 是 创建 的 DBPEDIA_ID/纳米医学 纳米 医学

dav009 commented 8 years ago

Training: dimensions:300, min threshold: 10, window: 10

dav009 commented 8 years ago

@nick-magnini model is trained, and some basic examples with entities similarities get what seems good results

positive=[u'DBPEDIA_ID/贝拉克·奥巴马', u'DBPEDIA_ID/俄罗斯'], negative=[u'DBPEDIA_ID/美国']
俄罗斯 -- 0.577268242836
DBPEDIA_ID/吉尔吉斯斯坦总统 -- 0.559932947159
DBPEDIA_ID/俄罗斯国家杜马 -- 0.534712553024
DBPEDIA_ID/乌克兰总统 -- 0.523086071014
哈萨克斯坦 -- 0.523066163063
DBPEDIA_ID/2008年俄罗斯总统选举 -- 0.519150972366
DBPEDIA_ID/纳扎尔巴耶夫 -- 0.518714308739
DBPEDIA_ID/哈萨克斯坦 -- 0.513309001923
DBPEDIA_ID/蒙古国总统 -- 0.513016939163
DBPEDIA_ID/普京 -- 0.512183487415
吉尔吉斯斯坦 -- 0.509196817875
DBPEDIA_ID/哈萨克斯坦总统 -- 0.507659435272
DBPEDIA_ID/梅德韦杰夫 -- 0.506721496582
DBPEDIA_ID/库奇马 -- 0.502971172333
DBPEDIA_ID/2012年俄罗斯总统选举 -- 0.502592504025
DBPEDIA_ID/俄罗斯总统 -- 0.501340091228
DBPEDIA_ID/尼古拉·萨科齐 -- 0.501158356667
DBPEDIA_ID/俄罗斯联邦总统 -- 0.501093864441
DBPEDIA_ID/巴基斯坦总理 -- 0.500897169113
DBPEDIA_ID/烏克蘭總統 -- 0.499838590622

Since it looks you are trying to build models with several tools, I will share the corpus + the model.

dav009 commented 8 years ago

@nick-magnini

Cleaned Chinese (zh) wiki2vec corpus : https://github.com/idio/wiki2vec/blob/feature/DP-zh-tokenizer-support/torrents/zh_chinese_wiki2vec_cleaned_corpus.torrent
Chinese (zh) Wiki2vec model : https://github.com/idio/wiki2vec/blob/feature/DP-zh-tokenizer-support/torrents/zh_chinese_wiki2vec_model.torrent

tgalery commented 8 years ago

Sorry for jumping so late in this discussion, but it might be a good call to implement something more generic, no ? The good thing about using Lucene Analyzers is that you could just use the analyzer for the corresponding locale and the job would be done. This would work for chinese, but also for check and other languages. ICU would be another possibility. Its chinese tokenizer seems to produce results as good as smartcn and again it would be kind of universal. Another possibility would be to specify a tokenizer / analyzer (if we think that simplification, lematization, stemming or morphological analysis would be also desirable operations) interface, so the community can write the respective classes they want.

dav009 commented 8 years ago

@nick-magnini any chance you can evaluate the generated model before I jump into a refactor ?

dav009 commented 8 years ago

@nick-magnini any news on reviewing the given branch ? otherwise I will close this issue

nick-magnini commented 8 years ago

Thanks. Let me discover and explore. Thanks again.

dav009 commented 8 years ago

If you are a chinese speaker and you could generate a dataset similar to : https://github.com/arfon/word2vec/blob/master/questions-words.txt it would be great

idio / wiki2vec

Support request: tokenization #13