wiki_all.txt 不太明白其作用

shangjingbo1226 / AutoPhrase

AutoPhrase: Automated Phrase Mining from Massive Text Corpora

Apache License 2.0

1.18k stars 276 forks source link

wiki_all.txt 不太明白其作用 #72

Closed crystal0913 closed 4 years ago

crystal0913 commented 4 years ago

按照paper中的理解，wiki_all.txt应该是一个非常大的n-gram 候选集合，为什么这儿给的wiki_all.txt只是含有少数噪声而大部分和wiki_quality.txt 一样呢？

remenberl commented 4 years ago

wiki_all.txt is not directly used for providing negative training. Instead, we rely on it to filter out potential false negatives. In other words,

wiki_quality -> confident positives frequent ngrams - wiki_all.txt -> confident negatives

Here, (A - B) is a set substraction operation.

crystal0913 commented 4 years ago

δ in the paper = wiki_all.txt - wiki_quality.txt , right ?

crystal0913 commented 4 years ago

要挖掘一个特定领域的短语，wiki_quality.txt中不在给定语料中的短语，其作为phrase quality estimator 的特征（如pmi、idf）是如何得到的呢？

remenberl commented 4 years ago

δ in the paper = wiki_all.txt - wiki_quality.txt , right ? Yes.

要挖掘一个特定领域的短语，wiki_quality.txt中不在给定语料中的短语，其作为phrase quality
estimator 的特征（如pmi、idf）是如何得到的呢？

We only extract features for phrases that are overlapped between wiki_quality and corpus. For those that not exist in the corpus, we simply ignore them.