thevermeer / pg_ts_semantic_headline

Improved highlighting of text search content for multi-word phrases and eliminating partial matches; Includes a methodology for delivering highlighted search result content 5x-10x faster that the PostgreSQL built-in ts_headline function
GNU General Public License v3.0
5 stars 0 forks source link

PostgreSQL Full-text Search and Semantic ts_headline functionality

Abstract

We explore a series of methods for improving the way that postgreSQL's ts_headline function resepects the semantics of phrase matching in TSQueries.

In the process, we uncover a method for replacing ts_headline that, when implemented using pre-computed columns, performs 5-10 times faster than the built-in function.

The end result is a PostgreSQL extension that can either directly replace ts_headline, providing full phrase highlighting and improved TSQuery semantics, or, when pairing with a pre-realized lookup table, can return highlighted content up to 10x faster than ts_headline.

Prerequisites

Installation

This project is in development February/April of 2024. Before we have a stable release version, if you wish to run the extension:

1) Clone this repository.

2) cd into the project directory /postgresql_semantic_tsheadline

3) Run make && make install :

4) In your target Postgres database, run CREATE EXTENSION TS_SEMANTIC_HEADLINE;

How to use TS_SEMANTIC_HEADLINE

How to use this extension will largely depend on your desired outcome, as this extension provides both:

That is, do you want to just improve the phrase highlighting of ts_headline, or do you ALSO want to dramatically spped up the recall of those highlighted search results?

How to highlight multi-word phrases and NOT highlight partial matches (ie. Replace TS_HEADLINE)

In order to simply replace the existing ts_headline function with a highlighter that matches entire, multi-word phrases and does NOT highlight partial terms, we can simply substitute TS_SEMANTIC_HEADLINE in for ts_headline - TS_SEMANTIC_HEADLINE comes in a overloaded to have the same method signature as ts_headline:

TS_SEMANTIC_HEADLINE ([config REGCONFIG,] content TEXT, user_search TSQUERY [, options TEXT])

The internals of the options behave a little differently than with ts_headline, in part due to our returning complete phrases highlighted instead of single words, in part due to making sacrifices in the depth of search to create a better perfoming feature.

How to implement content retrieval that is ~5x-10x faster than ts_headline

TS_FAST_HEADLINE is much faster than the built-in ts_headline because it can make use of pre-computed columns. Internally, the system uses a TSVector as a lookup index and a TEXT[] array as a recall vector. Where the TSVector is a word-stemmed index for fuzzy search, the TEXT[] array is used for fast recall of the original, exact text. The schema pattern that allows us to greatly improve the performance of both the search and recall, by trading off a significant reducion in computationally expensive operations, for an increased requirement for disk space usage.

The fastest approach required the realization of 2 (two) separate columns to any content lookup table; one of type TSPVECTOR (a tsvector with rules) and a TEXT[] array, like so:

CREATE TABLE my_content_index (
    id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    ... other columns ...
    content_tsv TSPVECTOR,
    content_arr text[]
);

CREATE UNIQUE INDEX files_pkey ON files(id int4_ops);
CREATE INDEX my_content_index_content_tsv_gin_idx ON files USING GIN (content_tsv tsvector_ops);

With those 2 columns in place, we can now query the table and deliver headlines with highlights much faster, by using:

TS_FAST_HEADLINE ([config REGCONFIG,] content_arr TEXT[], content_tspv TSPVECTOR, user_search TSPQUERY [, options TEXT])

As an example, we can now produce fast headlines like so:

SELECT TS_FAST_HEADLINE('english', content_arr, content_tsv, TO_TSPQUERY('user<->phrase & search | term')) FROM my_content_index; 

The options argument can be used to fine-tune tradeoffs in performance v. accuracy.

Note: TS_FAST_HEADLINE uses the pre-processed TSPVector and TSPQuery types

See below, in the usage section for more information on how to form pre-preprocessed queries and vectors for using TS_FAST_HEADLINE and TSP_QUERY_MATCHES

Purpose

The original purpose of this repository is to document some issues encountered with using PostgreSQL full-text search to display highlighted search results to the user, and propose a solution that is expressed firstly as PGSQL user-defined functions (UDFs). The goal of creating this functionality is to demonstrate the value to the PostgreSQL community of correcting the ts_headline function to better reflect the actual semantics of the full-text search operators used for index lookup.

The emergence of performance wins came as an end in this investigation, but that was not clear at the outset. The actual the goal is to introduce better ts_headline semantics into the postgresql source code, but that is a different project althoether. To get there, let's first outline the current issues and create a few UDFs to address the issue and illuminate the value of the improvements.

Preamble

I am a full-stack developer, building a web application that has stored a large volume of texts, as rows in a postgreSQL data table. The UI we are building performs full-text search by accepting a user-inputted string, searching, and displaying results. When we display results, we want to display passages from the text with "highlights", emphasizing the portions of the passage that match our search.

The user has at their disposal the ability to search a single word or a multi-word phrase; they can join phrases together with AND/OR operators, group logical expressions with brackets and negate the presence of certain words with NOT. [pgsql Full-text search - 12.1.2 matching]

In the database, we have created a table to store the text contents of a large volume of files. For each tuple in the "files" table, we have a file ID, a text field, and a TS Vector column that is a pre-realized, because adhoc realization of TSVector of a large table of text is very memory-(and time-)intensive, and likewise such that we can create a GIN index of the TSVector column for fast lookup. This is a recommended pattern for large-scale full-text search in PostgreSQL. [pgsql Full-text Search - 12.2.2 Creating Indexes]

Problems with ts_headline

As a developer, the built-in ts_headline offers a number of quirks and gotchas, and at the highest level, it is fair to say that the internals to ts_headline do not abide by the intended meaning of full-text operators. Specifically when the user has inputted a multi-word phrase, we find that ts_headline will return only partial matches, and only highlight single words from the phrase.

1. ts_headline returns passages that do NOT contain the searched phrase

Approach: [Guarantee that a TS Headline conforms to a ts query phrase]

2. How can I convert a fuzzy full-text search into the exact phrase matches in a document

Approach: [Produce the exact content that pgsql matches on a fuzzy search]

3. ts_headline is very slow. Our solution to Problem 2 is slower.

Approach: [An approach to content hihglighting that is up to 10 times faster than built-in ts_headline]

4. ts_headline only highlights single words for multi-word phrase queries

Approach: [Headline function that highlights phrases without partial matches]

Solution

The overall solution to all 4 of the above problems, derived in the various Approach sections above comes together, both from the implementation of a table with 2 pre-computed columns for more efficiecnt search AND content retrieval. Therefore, our solution consists of:

Core Functions

TSQuery, TSPQuery and Query Formation

TSVector, TSPVector and Lookup Index Formation

PostgreSQL Text Search Type Coersion

Fast-Lookup Table (Optional)

The required columns are of types TSPVector and TEXT[]:

CREATE TABLE my_content_index (
    id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    ... other columns ...
    content_tspv TSPVECTOR,
    content_arr  text[]
);

CREATE UNIQUE INDEX files_pkey ON files(id int4_ops);
CREATE INDEX my_content_index_content_tspv_gin_idx ON files USING GIN (content_tspv tsvector_ops);

Text Treatment for Positional Indexing, etc

Outcomes

Highlighting Entire Phrases in a TSQuery

The TS_SEMANTIC_HEADLINE function will highlight matching, multi-word phrase patterns in our source text. Compare that, side-by-side to the built-in ts_headline:

SELECT 
TS_SEMANTIC_HEADLINE('I can highlight search results as phrases, and not just single terms', TO_TSPQUERY('search<3>phrases')),
ts_headline('I cannot highlight search results as phrases, and only single terms', TO_TSPQUERY('search<3>phrases'));
ts_semantic_headline ts_headline
I can highlight \<b>search results as phrases,\</b> and not just single terms I cannot highlight \<b>search\</b> results as \<b>phrases\</b>, and only single terms

Partial Matches are NOT highlighted

The built-in ts_headline function will not respect the phrase operators within a TSQuery, and will highlight single words and partially matching terms. TS_SEMANTIC_HEADLINE enforces the notion that all highlighted matches' content will abide the semantics of the TSQuery:

SELECT 
TS_SEMANTIC_HEADLINE('phrase matches are highlighted, partial matches are not', TO_TSPQUERY('phrase<->match')),
ts_headline('phrase matches are highlighted, partial matches are as well', TO_TSPQUERY('phrase<->match'));
ts_semantic_headline ts_headline
\<b>phrase matches\</b> are highlighted, partial matches are not \<b>phrase\</b> \<b>matches\</b> are highlighted, partial \<b>matches\</b> are as well

Performs 5x-10x Faster than ts_headline (with pre-computed TSVector and TEXT[] columns)

We identified that full-text content recall is slow to do adhoc, because the retrieval requires manipulating and slicing large strings. At the same time, the built-in ts_headline function performs this reduction of text on every pass. If we have a table with a pre-computed TSVector of the source text AND a pre-computed array of words as they appear in the TSVector (delimited by spaces in the source), we can radically improve the performance to content recall in postgreSQL full-text search

Though not required, if we pre-compute both the lookup index (TSVector) and the recall array (TEXT[]), we can perform semantically-correct content highlighting and recall roughly 10 times faster than using the built-in ts_headline function.

Performing our search-and-recall across 100 documents, each with nearly the maximum number of words permitted in a TSVector, we see:

EXPLAIN ANALYZE SELECT TS_FAST_HEADLINE(content_arr, content_tsv, TO_TSPQUERY('best<2>time')) FROM files;
QUERY PLAN
Seq Scan on files (cost=0.00..53.00 rows=100 width=32) (actual time=8.646..793.117 rows=100 loops=1)
Planning Time: 0.117 ms
Execution Time: 793.227 ms

Compared to the built-in ts_headline function performing the same task:

EXPLAIN ANALYZE SELECT ts_headline(content, TO_TSPQUERY('best<2>time')) FROM files;
QUERY PLAN
Seq Scan on files (cost=0.00..53.00 rows=100 width=32) (actual time=60.714..5724.190 rows=100 loops=1)
Planning Time: 0.115 ms
Execution Time: 5724.348 ms

Improves ts_headline semantics without additional indices or pre-computed columns

If you do not want to pre-compute (or re-compute) a TSVector for search, or realizing a TEXT[] array of a long string seems too expensive, we also have a flavour of TS_SEMANTIC_HEADLINE that has the same method signature as the built-in ts_headline function, and uses function that under-the-hood to perform semantically-correct content highlighting of phrases, highlighting multi-word phrases and eliminating partial matches.

Usage

Parsing Documents

In order to perform fast content retrieval, we are going to treat our source text such that the positions within our language-stemmed TSVector will align with the word positions in our TEXT[] array of words. In order to do that, we will use the TSP_INDEXABLE_TEXT function, which accpets TEXT, cleans the string of special-character delimited words, and returns text:

TSP_INDEXABLE_TEXT(TEXT) RETURNS TEXT
As an example, SELECT TSP_INDEXABLE_TEXT('hyphen-delimited and other.such~terms are treated'); will return: ts_prepare_text_for_tsvector
hyphen- delimited and other. such~ terms are treated

Note that the  (Bell Character + SPACE) have been inserted to break apart special character-delimited terms, in order to maintain 'word' positionality between the TSVector and content arrays.

With that, in order to render a conformant TSVector, we will run:

SELECT TO_TSVECTOR(TSP_INDEXABLE_TEXT('our content to index'));

In order to generate our precomputed TEXT[] array, we will do much the same:

SELECT regexp_split_to_array(TSP_INDEXABLE_TEXT('our content to index'), '[\s]+');

Finally, if we are realizing the TSVector and TEXT[] array to a lookup table, we can do both, in one step, in our trigger function:

CREATE OR REPLACE FUNCTION trg_update_content_tsv_and_arr()
RETURNS TRIGGER AS $$
BEGIN
    NEW.content_tsv := TO_TSPVECTOR(NEW.content);   
    NEW.content_arr := TO_TSP_TEXT_ARRAY (NEW.content, '[\s]+');
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

Parsing Queries

The preparations that we make on the haystack TSVector also need to be made to the TSQuery.

Given the above section on Parsing Documents, for the purposes of our TSQuery, we are going to have to replace all special characters, as we do above, however, we will have to replace the characters with the TSQuery distance=1 operator, <1>. This will match the pattern of the Bell Character+SPACE inserted into the haystack, where the Bell Character is ignored when the TSVector is parsed and term is treated as 2 words. In terms of worldwide languages, this is probably a terrible idea and needs to be rethought out. For now, the goal is fast content retrieval, but do keep this in mind: Not all languages have space-delimited words, splitting on special; characters can either change or nullify meaning, amongst many other factors.

With all of that said, we will treat the needle with the same preparaions as we do the haystack, but are required to preserve the logical special characters inside of the query.

To render a TSQuery with compatible string preparations:

TO_TSPQUERY ([ config regconfig, ] query_string TEXT) RETURNS TSQUERY

Consider

SELECT 
TO_TSPQUERY( 'seek-ing<2>find.ing<1>the<1>needle<3>in-fix'),
to_tsquery('seek-ing<2>find.ing<1>the<1>needle<3>in-fix');
ts_prepare_query to_tsquery
'seek' \<-> 'ing' \<2> 'find' \<-> 'ing' \<2> 'needl' \<4> 'fix' 'seek-' \<-> 'seek' \<-> 'ing' \<2> 'find.ing' \<2> 'needl' \<3> ( 'in-fix' \<2> 'fix' )

For rendering phrases to a TSQuery, we also have a replacement for the built-in phraseto_tsvector function.

PHRASETO_TSPQUERY ([ config regconfig, ] query_string TEXT) RETURNS TSQUERY

Consider again:

SELECT 
PHRASETO_TSPQUERY('seek-ing find.ing the needle in-fix'),
PHRASETO_TSQUERY('seek-ing find.ing the needle in-fix');
phraseto_tspquery phraseto_tsquery
'seek' \<-> 'ing' \<-> 'find' \<-> 'ing' \<2> 'needl' \<2> 'fix' 'seek-' \<-> 'seek' \<-> 'ing' \<-> 'find.ing' \<2> 'needl' \<-> 'in-fix' \<2> 'fix'

TODO : Implement websearch_to_tsquery

Highlighting Search Results

To present search results, it is ideal to show a part of each document and how it is related to the query. Usually, search engines show fragments of the document with marked search terms. PostgreSQL provides a function ts_headline that implements this functionality, however the built-in ts_headline function does not respect the semantics of phrase operators in TSQueries, and it is very computationally expensive.

In order to correct these two pressing issues, this extension offers two (2) separate approaches to replacing ts_headline with TS_SEMANTIC_HEADLINE OR ts_fast_headline:

TS_SEMANTIC_HEADLINE([ config REGCONFIG, ] document TEXT, query TSQUERY [, options TEXT ]) returns TEXT

ts_semantic_headline accepts a document along with a query, and returns an excerpt from the document in which terms from the query are highlighted. Specifically, the function will use the query to select relevant text fragments, and then highlight the phrases that match the subconditions of the query. Where ts_headline will highlight terms even if those word positions do not match the query's restrictions, TS_SEMANTIC_HEADLINE will highlight the multi-word phrase patterns, in the form that they are matching in search. The configuration to be used to parse the document can be specified by config; if config is omitted, the default_text_search_config configuration is used.

For example

SELECT TS_SEMANTIC_HEADLINE
       ('english',

        'The most common type of search is to find all documents containing given query-terms 
         and return them in order of their similarity to the query.',

        TO_TSPQUERY('english', 'query-terms & similarity<3>query'));
tsp_semantic_headline
\<b>query-terms\</b>and return them in order of their \<b>similarity to the query\</b>
TS_FAST_HEADLINE([ config REGCONFIG, ] content_arr TEXT[], content_tspv TSPVECTOR, query TSPQUERY [, options TEXT ]) returns TEXT

ts_fast_headline accepts a pre-computed, positionally parsable content_arr TEXT[] array and a valid TSPVECTOR. Given a content TEXT string of the original document content, the assumptions are:

The options, as implemented, do not match 1:1 with the same options for ts_headline, and the differences should be noted below. If an options string is specified it must consist of a comma-separated list of one or more option=value pairs. The available options are:

These option names are recognized case-insensitively. You must double-quote string values if they contain spaces or commas.

In non-fragment-based headline generation, ts_headline (and thus ts_semanticheadline) locates matches for the given query and chooses a single one to display, preferring matches that have more query words within the allowed headline length. In fragment-based headline generation, ts_headline locates the query matches and splits each match into “fragments” of no more than MaxWords words each, preferring fragments with more query words, and when possible “stretching” fragments to include surrounding words. The fragment-based mode is thus more useful when the query matches span large sections of the document, or when it's desirable to display multiple matches.

TODO:

1) Impelemnt HighlightAll 2) Currently: In either mode, if no query matches can be identified, we return NULL, however, this will do weird things with JOINs and as a result, we should really return text as per the ts_headline spec:

In either mode, if no query matches can be identified, then a single fragment of the first MinWords words in the document will be displayed.