tf-idf法 - Githubissues

shunsuke227ono / pelican

News Curation Application with AI Recommendation Engine

https://the-pelican.herokuapp.com/

3 stars 1 forks source link

Closed shunsuke227ono closed 9 years ago

shunsuke227ono commented 9 years ago

tf-idf法とそのために必要な本文への事前処理を明確にする

shunsuke227ono commented 9 years ago

shunsuke227ono commented 9 years ago

名詞と動詞だけ抽出とかした方が良い気がするけど、どうだろうか。そのへんはチューニング事項。いろいろ試そう。

shunsuke227ono commented 9 years ago

shunsuke227ono commented 9 years ago

自分で計算するのも別に大変ではなさそうだが http://takuti.me/note/tf-idf/

とりまgemで計算してくれて一番人気なのはこれ https://github.com/jpmckinney/tf-idf-similarity らしい。文章同士の類似度までだせるっぽいのでドンピシャな予感。 => 類似度もtf-idfもでる。

あ、だけどこれ日本語いけんのか？怪しいぞ...形態素解析までしやがってるからな..,。形態素解析部分dかえmecabに任せたいけどできるかな。 => 案の定日本語の形態素解析はできない。 => まぁそんな複雑でもないし自分で式libにかきましょう。

shunsuke227ono commented 9 years ago

shunsuke227ono commented 9 years ago

自分でかけるのと同等にシンプルなgemがあった。https://github.com/reddavis/TF-IDF tf-idfの計算だけしてくれる。

類似度をベクトルの距離で求めるのは自分で計算するか。