shangjingbo1226 / AutoPhrase

AutoPhrase: Automated Phrase Mining from Massive Text Corpora
Apache License 2.0
1.18k stars 276 forks source link

wiki_all.txt 不太明白其作用 #72

Closed crystal0913 closed 4 years ago

crystal0913 commented 4 years ago

按照paper中的理解,wiki_all.txt应该是一个非常大的n-gram 候选集合,为什么这儿给的wiki_all.txt只是含有少数噪声而大部分和wiki_quality.txt 一样呢?

remenberl commented 4 years ago

wiki_all.txt is not directly used for providing negative training. Instead, we rely on it to filter out potential false negatives. In other words,

wiki_quality -> confident positives frequent ngrams - wiki_all.txt -> confident negatives

Here, (A - B) is a set substraction operation.

crystal0913 commented 4 years ago

δ in the paper = wiki_all.txt - wiki_quality.txt , right ?

crystal0913 commented 4 years ago

要挖掘一个特定领域的短语,wiki_quality.txt中不在给定语料中的短语,其作为phrase quality estimator 的特征(如pmi、idf)是如何得到的呢?

remenberl commented 4 years ago

δ in the paper = wiki_all.txt - wiki_quality.txt , right ? Yes.

要挖掘一个特定领域的短语,wiki_quality.txt中不在给定语料中的短语,其作为phrase quality
estimator 的特征(如pmi、idf)是如何得到的呢?

We only extract features for phrases that are overlapped between wiki_quality and corpus. For those that not exist in the corpus, we simply ignore them.