Search Relevancy - A Schema Perspective

This document will present the key concepts of the following concerns:

Gain understanding for leading industry search relevancy patterns
Review and understand how Google's modern search engine utilizes schema to optimize search
Discuss the required building blocks for search relevancy framework

Introduction

Once reviewing the world's most popular and industry leading search engine that is a part of our every day activities - Google Search Engine, Its apparent that the search is being conducted with many additional notions in addition to the basic 'phrase' one is querying for.

Let us review these notions to get a better understanding of how google enables its search to be both relevant an accurate:

Entities

Google describes an entity, or named entity, as a single, well-defined thing or concept. Google Search works in three stages, and not all pages make it through each stage.

Crawling: Google downloads text, images, and videos from pages it found on the internet with automated programs called crawlers.
Indexing: Google analyzes the text, images, and video files on the page, and stores the information in the Google index, which is a large database.
Serving search results: When a user searches on Google, Google returns information that's relevant to the user's query.

When a user enters a query, google search the index for matching pages and return the results that are the highest quality and most relevant to the user.

Relevancy is determined by hundreds of factors, which could include information such as the user's location, language, and device (desktop or phone).

For example:

_searching for "bicycle repair shops" would show different results to a user in Paris than it would to a user in Hong Kong._

Entities compose our everyday world and also our spoken language. We talk about entities and think in terms of things and entities. To reflect this google (and many other companies) have turned to seek assistance from the knowledge base domain.

Common Entities

One of the core mechanisms they will use is entity recognition. If google understands that a query contains the same entities as another they have seen before with little in the way of qualifiers, then that would be an indication that the result sets may be identical or highly similar.

Standardization of the domain knowledge

- https://schema.org/docs/schemas.html

Schema.org is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond.

Schema.org vocabulary can be used with many different encodings, these vocabularies cover entities, relationships between entities and actions. It can easily be extended through a well-documented extension model. Over 10 million sites use Schema.org to markup their web pages and email messages.

Many applications from Google, Microsoft, Pinterest, Yandex and others already use these vocabularies to power rich, extensible experiences.

Schema.org was founded by Google, Microsoft, Yahoo and Yandex, Schema.org vocabularies are developed by an open community process

Schema.org is defined as two hierarchies:

One for textual property values
One for the things that they describe.

The main schema.org hierarchy is a collection of types (or "classes"), each of which has one or more parent types.

- https://www.wikidata.org/wiki/Wikidata:Database_reports/EntitySchema_directory

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.

The focus of Wikidata is structured data. Structured data refers to data that has been organized and is stored in a defined way, often with the intention to encode meaning and preserve the relationships between different data points within a dataset.

The Wikidata repository consists mainly of items, each one having a label, a description and any number of aliases. Statements describe detailed characteristics of an Item and consist of a property and a value.

RankBrain -

RankBrain is a system developed by Google - by which Google can better understand the likely user intent of a search query. It was rolled out in the spring of 2015.

RankBrain, at its core, can be thought of as a pre-screening system. When a query is entered into Google, the search algorithm matches the query against your intent in an effort to surface the best content, in the best format(s).

Depending on the keyword, RankBrain will increase or decrease the importance of backlinks, content freshness, content length, domain authority etc.

Then, it looks at how Google searchers interact with the new search results. If users like the new algorithm better, it stays. If not, RankBrain rolls back the old algorithm.

Before RankBrain, Google would scan pages to see if they contained the exact keyword someone searched for. Because these keywords were sometimes brand new, Google had no clue what the searcher actually wanted. So they guessed.

For example, let’s say you searched for “the grey console developed by Sony”. Google would look for pages that contained the terms “grey”, “console”, “developed” and “Sony” it would often not result in the anticipated manner the customer was intending to.

RankBrain workd By matching never-before-seen keywords to keywords that Google HAS seen before. For example, Google RankBrain may have noticed that lots of people search for “grey console developed by Nintendo”. And they’ve learned that people who search for “grey console developed by Nintendo” want to see a set of results about gaming consoles.

So when someone searches for “the grey console developed by Sony”, RankBrain brings up similar results to the keyword it already knows (“grey console developed by Nintendo”).

So it shows results about consoles. In this case, the PlayStation.

Ranking Search Results Based on Entity Metrics

Ranking Search Results Based On Entity Metrics is the title of a Google patent they were granted in 2015. According to the patent, the ranking of entities for search involves considering Few factors:

Relatedness: Relatedness is determined based on the co-occurrence entities. In practice - if two entities are referenced frequently on the web (for example, “Donald Trump” and “President”) you get something like: president of the united states... This is due to the fact that they exist frequently enough together and on authoritative enough properties to return as a single result.

This same process connects other entities with the term when we pluralize it: 'presidents of the united states'. Each of these people is an entity. These entities are associated with the entity “President” and thus, when the query is plural– we see all of them at once.

Google uses metric in the next fashion: the more valuable an entity is (determined by things including links, reviews, mentions, and relevance), the lower the value of the category or topic it’s competing in, the higher its notability - it is similar to the TF/IDF concept.

Contribution. Contribution is determined by external signals (e.g., links, reviews) and is basically a measure of an entity’s contribution to a topic. A review from a well-established and respected food critic would add to this metric more than Dave’s rant on Yelp about the price because their entity contribution in the space is higher.
Prizes. The prize metric is exactly what it sounds like, a measure of the various relevant prizes an (Person for that matter) entity has received. These could be a Nobel Prize, an Oscar, or a U.S. Search Award. The type of prize determines its weight and the larger the prize the higher the value attached to the entity in question.

For Example : lets assume a search for [best actresses].

Google will run the query through these process in this order:

1) Determine the relatedness of other entities and assign values. 2) Determine the notability of those entities and assign a value to each. 3) Determine the contribution metrics of these entities and assign a value. 4) Determine any prizes awarded to the entities and assign a value. 5) Determine the applicable weights each should have based on the query type 6) Determine a final score for each possible entity.

This chain of evaluations allow for a composite scoring system to give accurate results for a large variety of use-cases.

Question Answering Using Entity References in Unstructured Data

Since it's equally important to have a capability to do a relevant and accurate search in an unstructured data as-well, google used the following techniques to address that:

1) In a document containing unstructured content - an entity extraction process is taking place to predict the assumed structures in that specific data. 2) Each extracted entity is assigned a unique identifier. Determining the most likely entity being requested by a searcher can be completed by establishing which entity appears the most times in the top K results. 3) Consult with an Entity database that helps saving process time for top results each time a query is run. That database exists for storing entities and their relations. 4) Entities are ranked by a quality score that may include freshness, previous selections by users, incoming links, and possibly outgoing links.

When a query for an entity is conducted, the relevance of other entities is determined for the result . 5) Context inference - for multiple entities with the same name. For example, there is Philadelphia the city, the cream cheese, and the movie. If asking a “where” question its referring to the city, “who acted in” would be the movie, and “what’s goes good with” would be the food.

With these techniques, Google’s capabilities around learning about entities and their relationships becomes significantly stronger.

Related Entities

The 'entities database' also stores relationships for each entity. These relationships are weighted according to some formula according to former search requests and their commonality in the data itself.

These entities/relations concept allow for:

The ability to calculate the probability of meeting the user’s likely intent with far greater accuracy.
The ability to predict and evolve an entity over time using past knowledge and well-defined schematic structures.

Google's 'entities database' AKA Knowledge-Graph is actually a massive database of public information, it collects information considered public domain and the properties of each entity (people with birthdays, siblings, parents, occupations, etc.).

Using the Entity Category association

Determining the categories a query belongs with may include generating a score based on:

Whether the query includes terms associated with the category
How of the entities inside the query, associate with the category.

When calculating the correct search categories which best describe the search intention, a scoring metrics must be selected and revised according to search results feedback.

Once categories are selected - the centroid of that category can be used as reference for inferring entities and links that may be relevant for the results.

Semantic Search

BERT

In 2019 BERT (Bidirectional Encoder Representations from Transformers) was introduced by Google. This AI engine focuses on further understanding intent and conversation search context.

BERT allows users to more easily find valuable and accurate information. Semantic search has also evolved in large part to the rise of voice search.

--- TODO --- add relevant references for usage of BERT

Basic Building Block for Better search results

In Opensearch we are in a constant effort to improve the relevancy and accuracy of the search results. It is especially important due to the fact that a vast amount of the engine's users are storing unstructured data for a variety of domains & use-cases.

Our goal is to allow each and every search to be the closest to the customer's intent - and doing so will require to address the concepts mentioned in this paper.

Support and Maintain a high Level Schema Structure

A modern search engine must be capable of maintaining the customer's (domain-related) schema structure. It makes no difference if the data was un-structured to begin with - Everyone expects the search result to be the most relevant.

Steps for Adding Schema related search relevancy capabilities:

1) Adding the Simple Schema to opensearch will allow explicitly preserving the customer's domain knowledge. It will be also very valuable if during the data ingestion phase - the engine can already perform some structure related tasks.

Enable code / template / Index generation from the Domain Specific Schema - this will allow additional explicit capability for customers to write domain related code that will help them develop their applications seamlessly

2) Using the industry standard GraphQL API and SDL will grant our users to easily integrate with many open-source GraphQL compliant tools and simplify development process.

3) Using a schema allows developers to build the ingestion process using domain related vocabulary and easily define business rules with their language of choice.

4) PPL and SQL language used today in opensearch will significantly use the schematic knowledge of the data to simplify the construction and validation of queries.

5) Creating a Domain Knowledge Graph containing the relationships between the domain entities will allow calculation of score that is based on the schema relationships that appear in the search.
- represent the schema with the entities and the relationships inside a unified logical layer that can be queried and evolve according to how the data/business evolves
  - allow constructions of ready-made reports and dashboards that are purely described using the business vocabulary.
  - simplify machine learning graph based techniques to organize the unstructured data and give better search predictions based on relations and entities.

6) Explainability - In recent years, AI has increasingly found its way from research labs into applications: from the recommendation systems used by online retailers to image recognition on social networks, and mainly the recently discussed search engine.

As we work with AI and rely on AI for more and more decision-making processes that influence our daily actions, issues around user understanding of such processes have garnered attention.

One of our main goals at opensearch is for the search engineers and customers to understand and trust the search algorithm ( which has AI incorporated inside) - increasing user satisfaction and enabling transparency of AI related decision-making.

We believe that search explainability is a first class citizen that deserves our full attention. Every search should have a clear and concise way of explaining itself to the user (whether its a search engineer or a customer).

This is why our goal will be the integration of the explainability notion to every part and section in the search engine's decision making steps.

It equally important to understanding how/why search results came for a given query in addition to the actual results evaluation .

opensearch-project / search-processor

[RFC] Search Relevancy - from A Schema Perspective #8