Closed crystal0913 closed 4 years ago
wiki_all.txt is not directly used for providing negative training. Instead, we rely on it to filter out potential false negatives. In other words,
wiki_quality -> confident positives frequent ngrams - wiki_all.txt -> confident negatives
Here, (A - B) is a set substraction operation.
δ in the paper = wiki_all.txt - wiki_quality.txt , right ?
要挖掘一个特定领域的短语,wiki_quality.txt中不在给定语料中的短语,其作为phrase quality estimator 的特征(如pmi、idf)是如何得到的呢?
δ in the paper = wiki_all.txt - wiki_quality.txt , right ?
Yes.
要挖掘一个特定领域的短语,wiki_quality.txt中不在给定语料中的短语,其作为phrase quality
estimator 的特征(如pmi、idf)是如何得到的呢?
We only extract features for phrases that are overlapped between wiki_quality and corpus. For those that not exist in the corpus, we simply ignore them.
按照paper中的理解,wiki_all.txt应该是一个非常大的n-gram 候选集合,为什么这儿给的wiki_all.txt只是含有少数噪声而大部分和wiki_quality.txt 一样呢?