Closed dongrixinyu closed 2 years ago
Hi @dongrixinyu, We never made an experiment with BERT using the CLS token as the representation of the whole document.
I think that the strategy you'll use to embed the document and each candidate phrases using BERT will have an important impact.
In any case we never tried the model with BERT, so I think some experiments should be made to see how it compares to sent2vec.
Is the CLS token of a pre-trained BERT a good representation of the whole document ?
You can have a look at this thread:
https://github.com/google-research/bert/issues/261
I'll suggest you as a first step to try to translate your document in english and try with the provided sent2vec models and check if you have satisfactory result.
Finally I see that you tried with TF-IDF , it means that you have a corpora of documents, TF-IDF is a very strong baseline, since it can make use of the corpora information whereas EmbedRank tries to solve the problem without using any corpora information (except the one used to train sent2vec model, i.e. Wikipedia)
Hi, Thanks for answering me! There are still some problems concerning this method. the sen2vec model is too large and I can not download it, so I did not try your code. I tried BERT in two ways, first is using CLS representation as the sentence embedding, second is adding all tokens in the text as the representation of the whole text. Why I choose these ways: CLS token represents is trained in a way similar to Sent2Vec, CLS token is purely comprised of all the rest tokens represented and not added to the loss function. And adding all tokens represented together (weighted) is an obvious way to obtain a sentence embedding. these two embeddings above can both be used for text classification. The fact I encounter: all cosine similarity values are restricted to a very small range, (0.59, 0.98). I never see any candidate phrase pairs produce a similarity below 0.5, this has a big influence on the result. Candidate phrases are extracted via POS and rules. I only choose NP. I think this is not a crucial aspect. TFIDF: I use 1 million news texts to train IDF parameters. I think the information contained in IDF is not abundant as in BERT. The Key Problem: no matter the BERT or the Sent2vec, the distance between text and itself is 1. Let's delete one word in this text, and compute the distance between the text and its deleted version. The value is maybe 0.98, or 0.99, which is extremely close to 1. Let's delete more words to a situation where only a candidate phrase containing 5 words is left, the distance with the text maybe 0.6. The value is more likely larger than the candidate phrase only containing less than 5 words, thus making the longer candidate phrases be extracted and ranked in the head. This phenomenon happens in Chinese more frequently because Chinese NP consists of more words than English statically. Much longer candidate phrases describe the details such as an unimportant person, locations, and objects, which are probably not the key info to the text. This is my point of view. If any mistakes or bugs in my thought, please leave a message. Thank you!
If you just add and don't average by the number of tokens you'll be biased towards longer phrases. SentVec averages bigrams/unigrams embedding of your input text.
If you have a specialized corpora , IDF information might be extremely valuable (to remove for example "corpora specific stopwords")
The similarity between text and itself is 1 true. If you remove one word yes it will very high. But then I don't totally agree with the rest (with sent2vec as representation)
There is certainly a bias around longer phrases being closer to the document, but as long as they remain relevant to the document. If the candidate phrase is long but not relevant to the overall document the similarity will be lower than a small phrase that is very relevant to the document.
I've translated your text in english (with deepl) and embedded your document using sent2vec and computed some similarities.
"epidemic" appears 4 times in the translated document. similarity("epidemic", document) = 0.36
"local time" appears 3 times in the translated document similarity("local time", document) = 0.18
Now let's try with "disease" which doesn't appear at all in the translated document similarity("disease", document) = 0.27
Yes the method might tend to extract long phrases, but if it's a long phrase that is not relevant to the document at all it will have a lower similarity with the document than a small phrase more relevant to the topic of the document.
If you are interested in the results of the keyphrase extraction: beta = 1 (no diversification) top 10 phrases
beta = 0.5 top 10
Note : The score that you see is not exactly the cosine similarity with the document, it has been renormalized.
Thank you for your answer!
I do notice that sent2vec provides cosine similarity between 0~1. And the result with MMR is accecptable.
a statistical test of all candidate phrases similarity
I made a test. Given a text, and extract all candidate phrases, and the compute the cosine sim* with sentence embedding, shorter phrases ranked majorly in the bottom, compared with longer phrases ranked in the head. Certainly, phrases related to the text get a higher value, but statistically, given two phrases both related to the text, e.g. Head of the construction party enterprise
and Head of the construction party
, the longer one get a higher value.
However I did not test it on Sent2vec, because 5G unigram model is still too large to download. So i wanna add a penalty factor to the length of candidate phrases.
This leaves a question, longer phrases are truely related to the text? or there are some detailed ,
trivial infomation among them, for example, an unknown fire Investigation Informant
。
If no any sentence embedding being presented, longest common string matching will easily extract trivial long phrases. That means there is no semantic info in this method. Sentence embedding aims to get the semantic representation of text, regardless of the long unimportant phrases. This is the ideal situation. The fact is sentence-to-vec is not good enough to express the semantic info of text, without a string of redundent info. We do not get a mature sen2vec technic. That is the key problem existed in all these methods.
Another way is more intuitive one, which add a context vector to the CBOW (I think you read this paper before), and use it as the sentence embedding. This method is similar with CLS token embedding.
As i observe in my test, averaging all tokens gives a more detailed sentence embedding, which contains much infomation about the phrases and words. However, context embedding like CLS and sen2vec based on CBOW and skipgram provide a more general sentence embedding.
I think the latter method can focus more on the main point of one text, rather than details.
Combine them together,I think it will yield a better result.
i test this method in chinese keyphrase extraction.
i use bert CLS embedding and added token embedding as the sentence sentence embedding. and apply your method.
the resulst shows that a lot of details rather than
key
phrases are extracted. i ll give u an example:u could use google translation api to get the english version of this news. this embed rank predicts:
however, the tfidf and lda features combined method predicts:
several other texts shows similar conclusion that embed rank prone to predict details, rather than core phrases i find the logger the candidate phrases are, the more similar with the whole text. i assume this happens in sen2vec