swisscom / ai-research-keyphrase-extraction

EmbedRank: Unsupervised Keyphrase Extraction using Sentence Embeddings (official implementation)
Apache License 2.0
432 stars 88 forks source link

This method tends to extract more detailed phrases, rather than the core ones #29

Closed dongrixinyu closed 2 years ago

dongrixinyu commented 4 years ago

i test this method in chinese keyphrase extraction.

i use bert CLS embedding and added token embedding as the sentence sentence embedding. and apply your method.

the resulst shows that a lot of details rather than key phrases are extracted. i ll give u an example:

据美国约翰斯·霍普金斯大学发布的新冠疫情最新统计数据显示,截至北京时间4月27日11时50分左右,全球新冠肺炎确诊病例达2971669例,死亡病例达206544例。 数据显示,美国是疫情最严重的国家,美国的新冠肺炎确诊病例已达965783例,死亡病例达54883例。紧随其后的分别是西班牙,意大利,法国,死亡病例均超过2万例。值得注意的是,美国确诊人数高出西班牙近74万例,西班牙目前累计确诊226629例。老友喊话特朗普:美国人的命比你的选举更重要如果你继续以自己为中心,继续玩弄愚蠢的政治……如果你意识不到自己的错误,你就做不对。”目前,特朗普已“取关”了这位老友。
视频来源:CGTN
据海外网消息,上周,美国总统特朗普在记者会上语出惊人,暗示也许可以注射消毒剂来杀死新冠病毒,随即引发舆论哗然。当地时间26日,同为共和党人的马里兰州州长拉里·霍根表示,他对白宫在美国面临新冠疫情危机的当下发出这种令人糊涂的信息感到担忧。他也强调,美国总统在试图向民众通报疫情时,坚持事实是“至关重要”的。
比尔·盖茨:中国在疫情暴发时做了对的事 可悲的是美国本可以做好
当地时间4月26日,比尔·盖茨接受CNN采访时,被问及怎么看待有人指责中国掩盖疫情,对此,他称赞中国在疫情暴发的“一开始就做了很多正确的事情”,可悲的是本来能做好的美国却做得特别差。他称指责中国是不正确和不公平的。加州官员:低收入社区新冠肺炎致死率是较富裕社区的三倍
据央视新闻报道,当地时间4月26日,美国加州洛杉矶县公共卫生局局长芭芭拉在当天举行的新冠肺炎疫情新闻发布会上表示,洛杉矶县当天新增440例新冠肺炎确诊病例,新增18例死亡病例。芭芭拉同时透露,生活在洛杉矶县低收入社区的居民死于新冠肺炎的可能性是较富裕社区的三倍,拥有低收入家庭超过30%的社区,每10万人中有16.5人因新冠肺炎死亡,而拥有低收入家庭低于10%的社区,每10万人中有5.3人死亡。
据海外网消息,美国媒体26日消息称,位于纽约市的美国海军“安慰”号医疗船的最后一名新冠肺炎患者于当天获准出院,而该医疗船的此次任务也接近尾声。据了解,自3月底抵达纽约至今,拥有1000张病床的“安慰”号仅收治了182名病人。
白宫经济顾问称美国4月失业率将达16%,经济面临历史冲击
据澎湃新闻报道,当地时间4月26日,白宫经济顾问凯文·哈塞特(Kevin Hassett)表示,4月份美国的失业率可能达到16%或更高,需要更多刺激措施以确保经济出现强劲反弹。
凯文·哈塞特在接受美国广播公司新闻网(ABCNews)采访时表示,“形势非常严峻”,“我认为,这是我们经济史上最大的负面冲击。”“我们将看到要失业率接近20世纪30年代大萧条时期的水平。”

u could use google translation api to get the english version of this news. this embed rank predicts:

低收入社区新冠肺炎致死率
白宫经济顾问凯文·哈塞特
世纪30年代
语出惊人
多刺激措施

however, the tfidf and lda features combined method predicts:

美国的新冠肺炎确诊病例
新冠病毒
低收入社区新冠肺炎致死率
美国海军
特朗普老友皮尔斯·摩根

several other texts shows similar conclusion that embed rank prone to predict details, rather than core phrases i find the logger the candidate phrases are, the more similar with the whole text. i assume this happens in sen2vec

kamilbs commented 4 years ago

Hi @dongrixinyu, We never made an experiment with BERT using the CLS token as the representation of the whole document.

I think that the strategy you'll use to embed the document and each candidate phrases using BERT will have an important impact.

In any case we never tried the model with BERT, so I think some experiments should be made to see how it compares to sent2vec. Is the CLS token of a pre-trained BERT a good representation of the whole document ?
You can have a look at this thread: https://github.com/google-research/bert/issues/261

I'll suggest you as a first step to try to translate your document in english and try with the provided sent2vec models and check if you have satisfactory result.

Finally I see that you tried with TF-IDF , it means that you have a corpora of documents, TF-IDF is a very strong baseline, since it can make use of the corpora information whereas EmbedRank tries to solve the problem without using any corpora information (except the one used to train sent2vec model, i.e. Wikipedia)

dongrixinyu commented 4 years ago

Hi, Thanks for answering me! There are still some problems concerning this method. the sen2vec model is too large and I can not download it, so I did not try your code. I tried BERT in two ways, first is using CLS representation as the sentence embedding, second is adding all tokens in the text as the representation of the whole text. Why I choose these ways: CLS token represents is trained in a way similar to Sent2Vec, CLS token is purely comprised of all the rest tokens represented and not added to the loss function. And adding all tokens represented together (weighted) is an obvious way to obtain a sentence embedding. these two embeddings above can both be used for text classification. The fact I encounter: all cosine similarity values are restricted to a very small range, (0.59, 0.98). I never see any candidate phrase pairs produce a similarity below 0.5, this has a big influence on the result. Candidate phrases are extracted via POS and rules. I only choose NP. I think this is not a crucial aspect. TFIDF: I use 1 million news texts to train IDF parameters. I think the information contained in IDF is not abundant as in BERT. The Key Problem: no matter the BERT or the Sent2vec, the distance between text and itself is 1. Let's delete one word in this text, and compute the distance between the text and its deleted version. The value is maybe 0.98, or 0.99, which is extremely close to 1. Let's delete more words to a situation where only a candidate phrase containing 5 words is left, the distance with the text maybe 0.6. The value is more likely larger than the candidate phrase only containing less than 5 words, thus making the longer candidate phrases be extracted and ranked in the head. This phenomenon happens in Chinese more frequently because Chinese NP consists of more words than English statically. Much longer candidate phrases describe the details such as an unimportant person, locations, and objects, which are probably not the key info to the text. This is my point of view. If any mistakes or bugs in my thought, please leave a message. Thank you!

kamilbs commented 4 years ago
  1. You might give a try with sent2vec unigram which is much lighter and requires less memory

If you just add and don't average by the number of tokens you'll be biased towards longer phrases. SentVec averages bigrams/unigrams embedding of your input text.

  1. If you have a specialized corpora , IDF information might be extremely valuable (to remove for example "corpora specific stopwords")

  2. The similarity between text and itself is 1 true. If you remove one word yes it will very high. But then I don't totally agree with the rest (with sent2vec as representation)

There is certainly a bias around longer phrases being closer to the document, but as long as they remain relevant to the document. If the candidate phrase is long but not relevant to the overall document the similarity will be lower than a small phrase that is very relevant to the document.

I've translated your text in english (with deepl) and embedded your document using sent2vec and computed some similarities.

"epidemic" appears 4 times in the translated document. similarity("epidemic", document) = 0.36

"local time" appears 3 times in the translated document similarity("local time", document) = 0.18

Now let's try with "disease" which doesn't appear at all in the translated document similarity("disease", document) = 0.27

Yes the method might tend to extract long phrases, but if it's a long phrase that is not relevant to the document at all it will have a lower similarity with the document than a small phrase more relevant to the topic of the document.

If you are interested in the results of the keyphrase extraction: beta = 1 (no diversification) top 10 phrases

image

beta = 0.5 top 10

image

Note : The score that you see is not exactly the cosine similarity with the document, it has been renormalized.

dongrixinyu commented 4 years ago

Thank you for your answer!

  1. I do notice that sent2vec provides cosine similarity between 0~1. And the result with MMR is accecptable.

  2. a statistical test of all candidate phrases similarity

I made a test. Given a text, and extract all candidate phrases, and the compute the cosine sim* with sentence embedding, shorter phrases ranked majorly in the bottom, compared with longer phrases ranked in the head. Certainly, phrases related to the text get a higher value, but statistically, given two phrases both related to the text, e.g. Head of the construction party enterprise and Head of the construction party, the longer one get a higher value.

However I did not test it on Sent2vec, because 5G unigram model is still too large to download. So i wanna add a penalty factor to the length of candidate phrases.

This leaves a question, longer phrases are truely related to the text? or there are some detailed , trivial infomation among them, for example, an unknown fire Investigation Informant

If no any sentence embedding being presented, longest common string matching will easily extract trivial long phrases. That means there is no semantic info in this method. Sentence embedding aims to get the semantic representation of text, regardless of the long unimportant phrases. This is the ideal situation. The fact is sentence-to-vec is not good enough to express the semantic info of text, without a string of redundent info. We do not get a mature sen2vec technic. That is the key problem existed in all these methods.

  1. averaging all tokens or an intuitive way of sentence embedding I read the Sent2Vec paper. This method is an extension of CBOW word2vec。And the final sentence embedding is an average of all uni and n-grams tokens. This is similar with averaging all tokens of BERT.

Another way is more intuitive one, which add a context vector to the CBOW (I think you read this paper before), and use it as the sentence embedding. This method is similar with CLS token embedding.

As i observe in my test, averaging all tokens gives a more detailed sentence embedding, which contains much infomation about the phrases and words. However, context embedding like CLS and sen2vec based on CBOW and skipgram provide a more general sentence embedding.

I think the latter method can focus more on the main point of one text, rather than details.

Combine them together,I think it will yield a better result.