improve indexing for text searching on our views pages

adf-ncgr commented 8 years ago

As noted by Nathan in LEGUME-318 Done , some of our queries are not well served by btree indexing. In particular, the "Contains" searches on text fields like gene.description end up doing a query involving ILIKE %

% that currently table scans if combined with no other indexed search field, resulting in fairly poor performance.

There are a variety of indexing strategies available for full-text-search with Postgres. Alex has already explored GIN indexing in the lis_gis application, and we should probably start evaluating how that would work in the context of the tripal mviews that back most if not all of our drupal-view-based query pages. Another alternative I found is "trigram" indexing as described here:
https://www.postgresql.org/docs/9.1/static/pgtrgm.html
this seems to be a little more directed towards "phoneme" oriented matching, but that might have some benefits to consider in our application (though doubtless some disadvantages as well).

[LEGUME-607] created by adf_ncgr

adf-ncgr commented 8 years ago

I had not heard of the trigram indexing before.

At first, it sounds like a good fit for this particular search case of wildcards with '%GO:xxxxx% and not supporting full-text searching.

But many of the chado.gene.descriptions are pretty lengthy so maybe the trigrams generate would cause resource strain of it's own:

"Trigram search gets much slower as search string gets longer."

http://dba.stackexchange.com/questions/111523/trigram-search-gets-much-slower-as-search-string-gets-longer?rq=1

by agr

adf-ncgr commented 8 years ago

Just as an aside about full-text searching in Postgresql, (as mentioned in GH-318 Done posted by Nathan):

The ilike query is not utilizing the btree index, as shown by Nathan's EXPLAIN in GH-318 Done .

I am using Postgresql FTS (full text search) in the germplasm/map app and also in another project here at NCGR. Basically- it works great!

Although my understanding is it does introduce some admin overhead of maintaining a tsvector column in sync with the text to be searched. Otherwise the tsvector is calculated on the fly, Not sure of the tradeoffs of calculating the tsvector on the fly, versus the sequential scan being done now.

Here are some steps I wrote up for a co-worker here recently on another project:

step 1: add tsvector column to table

alter table plateparty.classification add taxon_fts tsvector;

step 2: add a GIN index to the tsvector
create index taxon_fts_idx on plateparty.classification using gin (taxon_fts);

step 3: update the column with to_tsvector (note: this step has to be manually done after every data load or update! or this could be setup with a trigger, possibly?)

update plateparty.classification set taxon_fts = to_tsvector('english',
concase_ws(' ', tax_..., tax_..., columns, ...)) ;

Note: in chado.gene this would just be something like

update chado.gene set description_fts = to_tsvector('english', description);

step 4: write query using either to_tsquery() or plainto_tsquery()

select ... from plateparty.classification
where taxon_fts @@ plainto_tsquery('english', 'foo bar bla');

by agr

legumeinfo / jira-issues

improve indexing for text searching on our views pages #573