opensearch-project / search-processor

Search Request Processor: pipeline for transformation of queries and results inline with a search request.
Apache License 2.0
22 stars 24 forks source link

[RFC] Search Relevancy - from A Schema Perspective #8

Closed YANG-DB closed 2 years ago

YANG-DB commented 2 years ago

Search Relevancy - A Schema Perspective

This document will present the key concepts of the following concerns:

Introduction

Once reviewing the world's most popular and industry leading search engine that is a part of our every day activities - Google Search Engine, Its apparent that the search is being conducted with many additional notions in addition to the basic 'phrase' one is querying for.

Let us review these notions to get a better understanding of how google enables its search to be both relevant an accurate:

Entities

Google describes an entity, or named entity, as a single, well-defined thing or concept. Google Search works in three stages, and not all pages make it through each stage.

When a user enters a query, google search the index for matching pages and return the results that are the highest quality and most relevant to the user.

Relevancy is determined by hundreds of factors, which could include information such as the user's location, language, and device (desktop or phone).

For example:

_searching for "bicycle repair shops" would show different results to a user in Paris than it would to a user in Hong Kong._

Entities compose our everyday world and also our spoken language. We talk about entities and think in terms of things and entities. To reflect this google (and many other companies) have turned to seek assistance from the knowledge base domain.

Common Entities

One of the core mechanisms they will use is entity recognition. If google understands that a query contains the same entities as another they have seen before with little in the way of qualifiers, then that would be an indication that the result sets may be identical or highly similar.

Standardization of the domain knowledge

- https://schema.org/docs/schemas.html

Schema.org is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond.

Schema.org vocabulary can be used with many different encodings, these vocabularies cover entities, relationships between entities and actions. It can easily be extended through a well-documented extension model. Over 10 million sites use Schema.org to markup their web pages and email messages.

Many applications from Google, Microsoft, Pinterest, Yandex and others already use these vocabularies to power rich, extensible experiences.

Schema.org was founded by Google, Microsoft, Yahoo and Yandex, Schema.org vocabularies are developed by an open community process

Schema.org is defined as two hierarchies:

The main schema.org hierarchy is a collection of types (or "classes"), each of which has one or more parent types.

- https://www.wikidata.org/wiki/Wikidata:Database_reports/EntitySchema_directory

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.

The focus of Wikidata is structured data. Structured data refers to data that has been organized and is stored in a defined way, often with the intention to encode meaning and preserve the relationships between different data points within a dataset.

The Wikidata repository consists mainly of items, each one having a label, a description and any number of aliases. Statements describe detailed characteristics of an Item and consist of a property and a value.

RankBrain -

RankBrain is a system developed by Google - by which Google can better understand the likely user intent of a search query. It was rolled out in the spring of 2015.

RankBrain, at its core, can be thought of as a pre-screening system. When a query is entered into Google, the search algorithm matches the query against your intent in an effort to surface the best content, in the best format(s).

Depending on the keyword, RankBrain will increase or decrease the importance of backlinks, content freshness, content length, domain authority etc.

Then, it looks at how Google searchers interact with the new search results. If users like the new algorithm better, it stays. If not, RankBrain rolls back the old algorithm.

Before RankBrain, Google would scan pages to see if they contained the exact keyword someone searched for. Because these keywords were sometimes brand new, Google had no clue what the searcher actually wanted. So they guessed.

For example, let’s say you searched for “the grey console developed by Sony”. Google would look for pages that contained the terms “grey”, “console”, “developed” and “Sony” it would often not result in the anticipated manner the customer was intending to.

RankBrain workd By matching never-before-seen keywords to keywords that Google HAS seen before. For example, Google RankBrain may have noticed that lots of people search for “grey console developed by Nintendo”. And they’ve learned that people who search for “grey console developed by Nintendo” want to see a set of results about gaming consoles.

So when someone searches for “the grey console developed by Sony”, RankBrain brings up similar results to the keyword it already knows (“grey console developed by Nintendo”).

So it shows results about consoles. In this case, the PlayStation.

Ranking Search Results Based on Entity Metrics

Ranking Search Results Based On Entity Metrics is the title of a Google patent they were granted in 2015. According to the patent, the ranking of entities for search involves considering Few factors:

This same process connects other entities with the term when we pluralize it: 'presidents of the united states'. Each of these people is an entity. These entities are associated with the entity “President” and thus, when the query is plural– we see all of them at once.

Google uses metric in the next fashion: the more valuable an entity is (determined by things including links, reviews, mentions, and relevance), the lower the value of the category or topic it’s competing in, the higher its notability - it is similar to the TF/IDF concept.

For Example : lets assume a search for [best actresses].

Google will run the query through these process in this order:

1) Determine the relatedness of other entities and assign values. 2) Determine the notability of those entities and assign a value to each. 3) Determine the contribution metrics of these entities and assign a value. 4) Determine any prizes awarded to the entities and assign a value. 5) Determine the applicable weights each should have based on the query type 6) Determine a final score for each possible entity.

This chain of evaluations allow for a composite scoring system to give accurate results for a large variety of use-cases.

Question Answering Using Entity References in Unstructured Data

Since it's equally important to have a capability to do a relevant and accurate search in an unstructured data as-well, google used the following techniques to address that:

1) In a document containing unstructured content - an entity extraction process is taking place to predict the assumed structures in that specific data. 2) Each extracted entity is assigned a unique identifier. Determining the most likely entity being requested by a searcher can be completed by establishing which entity appears the most times in the top K results. 3) Consult with an Entity database that helps saving process time for top results each time a query is run. That database exists for storing entities and their relations. 4) Entities are ranked by a quality score that may include freshness, previous selections by users, incoming links, and possibly outgoing links.

With these techniques, Google’s capabilities around learning about entities and their relationships becomes significantly stronger.

Related Entities

The 'entities database' also stores relationships for each entity. These relationships are weighted according to some formula according to former search requests and their commonality in the data itself.

These entities/relations concept allow for:

Google's 'entities database' AKA Knowledge-Graph is actually a massive database of public information, it collects information considered public domain and the properties of each entity (people with birthdays, siblings, parents, occupations, etc.).

Using the Entity Category association

Determining the categories a query belongs with may include generating a score based on:

When calculating the correct search categories which best describe the search intention, a scoring metrics must be selected and revised according to search results feedback.

Once categories are selected - the centroid of that category can be used as reference for inferring entities and links that may be relevant for the results.

Semantic Search

BERT

In 2019 BERT (Bidirectional Encoder Representations from Transformers) was introduced by Google. This AI engine focuses on further understanding intent and conversation search context.

BERT allows users to more easily find valuable and accurate information. Semantic search has also evolved in large part to the rise of voice search.

--- TODO --- add relevant references for usage of BERT


Basic Building Block for Better search results

In Opensearch we are in a constant effort to improve the relevancy and accuracy of the search results. It is especially important due to the fact that a vast amount of the engine's users are storing unstructured data for a variety of domains & use-cases.

Our goal is to allow each and every search to be the closest to the customer's intent - and doing so will require to address the concepts mentioned in this paper.

Support and Maintain a high Level Schema Structure

A modern search engine must be capable of maintaining the customer's (domain-related) schema structure. It makes no difference if the data was un-structured to begin with - Everyone expects the search result to be the most relevant.

Steps for Adding Schema related search relevancy capabilities:

1) Adding the Simple Schema to opensearch will allow explicitly preserving the customer's domain knowledge. It will be also very valuable if during the data ingestion phase - the engine can already perform some structure related tasks.

6) Explainability - In recent years, AI has increasingly found its way from research labs into applications: from the recommendation systems used by online retailers to image recognition on social networks, and mainly the recently discussed search engine.

As we work with AI and rely on AI for more and more decision-making processes that influence our daily actions, issues around user understanding of such processes have garnered attention.

One of our main goals at opensearch is for the search engineers and customers to understand and trust the search algorithm ( which has AI incorporated inside) - increasing user satisfaction and enabling transparency of AI related decision-making.

We believe that search explainability is a first class citizen that deserves our full attention. Every search should have a clear and concise way of explaining itself to the user (whether its a search engineer or a customer).

This is why our goal will be the integration of the explainability notion to every part and section in the search engine's decision making steps.

It equally important to understanding how/why search results came for a given query in addition to the actual results evaluation .

YANG-DB commented 2 years ago

Closing this issue - this is a proposal that will be realized in the simple schema repository https://github.com/opensearch-project/simple-schema