How does Google calculate a document embeddings using BERT in its new search?

ghost commented 4 years ago

Google has started using BERT in its search engine. I imagine it creates embeddings for the query on the search engine, and then find a kind of similarity measure with the potential candidate websites/pages, finally ranking them in search results.

I am curious how do they create embeddings for the documents (the potential candidate websites/pages) if any? Or am I interpreting it wrong?

siavoshkaviani commented 4 years ago

BERT, which stands for Bidirectional Encoder Representations from Transformers, is a neural network-based technique for natural language processing pre-training. In plain English, it can be used to help Google better discern the context of words in search queries.

For example, in the phrases “nine to five” and “a quarter to five,” the word “to” has two different meanings, which may be obvious to humans but less so to search engines. BERT is designed to distinguish between such nuances to facilitate more relevant results.

Natural language processing (NLP) refers to a branch of artificial intelligence that deals with linguistics, to enable computers to understand the way humans naturally communicate.

The breakthrough of BERT is in its ability to train language models based on the entire set of words in a sentence or query (bidirectional training) rather than the traditional way of training on the ordered sequence of words (left-to-right or combined left-to-right and right-to-left). BERT allows the language model to learn word context based on surrounding words rather than just the word that immediately precedes or follows it.

Google calls BERT “deeply bidirectional” because the contextual representations of words start “from the very bottom of a deep neural network.”

“For example, the word ‘bank‘ would have the same context-free representation in ‘bank account‘ and ‘bank of the river.‘ Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence ‘I accessed the bank account,’ a unidirectional contextual model would represent ‘bank‘ based on ‘I accessed the‘ but not ‘account.’ However, BERT represents ‘bank‘ using both its previous and next context — ‘I accessed the… account.'” BERT will enhance Google’s understanding of about one in 10 searches in English in the U.S.

Particularly for longer, more conversational queries, or searches where prepositions like ‘for’ and ‘to’ matter a lot to the meaning, Search will be able to understand the context of the words in your query,” Google wrote in its blog post.

However, not all queries are conversational or include prepositions. Branded searches and shorter phrases are just two examples of types of queries that may not require BERT’s natural language processing.

Ivan-Flecha-Ribeiro commented 4 years ago

The following curation might be useful to the OP and similar-minded practitioners:

Document Embedding State of The Art Code

google-research / bert

How does Google calculate a document embeddings using BERT in its new search? #957