textacular / textacular

Textacular exposes full text search capabilities from PostgreSQL, and allows you to declare full text indexes. Textacular will extend ActiveRecord with named_scope methods making searching easy and fun!
https://github.com/textacular
942 stars 87 forks source link

Indexing on web_search? #140

Open tomcardoso opened 3 years ago

tomcardoso commented 3 years ago

I'm a bit confused as to how I can set up indexing on web search, especially if I want to use the approach we talked about in https://github.com/textacular/textacular/issues/137. To recap, what I'm doing is:

order, conditions = cases.web_search_similarities_and_conditions({
  case_number: keywords,
  short_title: keywords,
  case_events: {
    party_name: keywords,
    short_title: keywords
  }}, false)

results = Case
  .joins(:case_events)
  .where(conditions.join(' OR ')) # depends on exclusive, could be AND as well
  .group(:id) # group by ID to deduplicate
  .order(Arel.sql("max(#{order.join(' + ')})")) # use aggregation function to pick highest rank for duplicates

Which searches a number of fields for a certain keyword and returns all results that match in any of those fields.

But I'm finding the search to be very slow for my database (~1.8m case event entries). I went in and added simple keyword indexes of this sort:

class AddKeywordIndexToCaseEvents < ActiveRecord::Migration[6.0]
  disable_ddl_transaction!

  def change
    add_index :case_events, :party_name, unique: false, algorithm: :concurrently
    add_index :case_events, :short_title, unique: false, algorithm: :concurrently
  end
end

But I have a feeling I may need to do something more. Do you have any advice on what I should do? I know Textacular recommends using the :sql schema file format, but I'd prefer to keep it as a schema.rb for compatibility reasons with my production database.

tomcardoso commented 3 years ago

That didn't work, so I've tried doing something like this instead:

add_index :cases, "to_tsvector('english', case_number || ' ' || short_title)", using: :gin, name: 'cases_keywords_idx', algorithm: :concurrently

But the search is still very, very slow (taking 10 seconds or longer). I've done some profiling and it's all down to the bottleneck created by the special two-table, four-column query I mentioned in my previous message. Here's an example query that lasts ~10 seconds:

SELECT "cases".* FROM "cases" INNER JOIN "case_events" ON "case_events"."case_id" = "cases"."id" WHERE (to_tsvector('english', "cases"."case_number"::text) @@ websearch_to_tsquery('english', 'police'::text) OR to_tsvector('english', "cases"."short_title"::text) @@ websearch_to_tsquery('english', 'police'::text) OR to_tsvector('english', "case_events"."party_name"::text) @@ websearch_to_tsquery('english', 'police'::text) OR to_tsvector('english', "case_events"."short_title"::text) @@ websearch_to_tsquery('english', 'police'::text)) GROUP BY "cases"."id" ORDER BY max(COALESCE(ts_rank(to_tsvector('english', "cases"."case_number"::text), websearch_to_tsquery('english', 'police'::text)), 0) + COALESCE(ts_rank(to_tsvector('english', "cases"."short_title"::text), websearch_to_tsquery('english', 'police'::text)), 0) + COALESCE(ts_rank(to_tsvector('english', "case_events"."party_name"::text), websearch_to_tsquery('english', 'police'::text)), 0) + COALESCE(ts_rank(to_tsvector('english', "case_events"."short_title"::text), websearch_to_tsquery('english', 'police'::text)), 0)) LIMIT $1 OFFSET $2;

I've tested both with and without that indexing, and it doesn't seem to really make a difference, which makes me wonder whether I need to index things some other way, or if it's not the ts_rank operations that are really slowing me down. Some testing suggested the search is fairly snappy until I get to .order(Arel.sql("max(#{order.join(' + ')})")), at which point the query grinds to a halt, so that must be an expensive operation. Any suggestions here would be much appreciated.

simi commented 2 years ago

@tomcardoso for some reason if you join and lookup into two tables GIN indexes are not used. I have managed it is possible to check for matching JOINed table in CTE and just pass ids to final query. You need to use ANY(ARRAY(... method to keep indexes being used. Here is example fast query.

WITH related_labels AS (
  SELECT id FROM labels WHERE account_id = 80 AND to_tsvector('english_nostop', "labels"."name"::TEXT) @@ websearch_to_tsquery('english_nostop', 'Hello'::TEXT)
)
SELECT
    "products" .*,
    COALESCE(ts_rank(to_tsvector('english_nostop', "products"."title"::TEXT), websearch_to_tsquery('english_nostop', 'Hello'::TEXT)), 0) + COALESCE(ts_rank(to_tsvector('english_nostop', "products"."catalog_number"::TEXT), websearch_to_tsquery('english_nostop', 'Hello'::TEXT)), 0) + COALESCE(ts_rank(to_tsvector('english_nostop', "products"."version"::TEXT), websearch_to_tsquery('english_nostop', 'Hello'::TEXT)), 0) + COALESCE(ts_rank(to_tsvector('english_nostop', "products"."artist"::TEXT), websearch_to_tsquery('english_nostop', 'Hello'::TEXT)), 0) + COALESCE(ts_rank(to_tsvector('english_nostop', "products"."barcode"::TEXT), websearch_to_tsquery('english_nostop', 'Hello'::TEXT)), 0) + COALESCE(ts_rank(to_tsvector('english_nostop', "products"."company_id"::TEXT), websearch_to_tsquery('english_nostop', 'Hello'::TEXT)), 0) + COALESCE(ts_rank(to_tsvector('english_nostop', "labels"."name"::TEXT), websearch_to_tsquery('english_nostop', 'Hello'::TEXT)), 0) AS "rank20534463617833501"
FROM
    "products"
LEFT OUTER JOIN "labels" ON
    "labels"."id" = "products"."label_id"
WHERE
    "products"."account_id" = 80
    AND (to_tsvector('english_nostop', "products"."title"::TEXT) @@ websearch_to_tsquery('english_nostop', 'Hello'::TEXT)
    OR to_tsvector('english_nostop', "products"."version"::TEXT) @@ websearch_to_tsquery('english_nostop', 'Hello'::TEXT)
    OR to_tsvector('english_nostop', "products"."artist"::TEXT) @@ websearch_to_tsquery('english_nostop', 'Hello'::TEXT)
    OR to_tsvector('english_nostop', "products"."barcode"::TEXT) @@ websearch_to_tsquery('english_nostop', 'Hello'::TEXT)
    OR to_tsvector('english_nostop', "products"."company_id"::TEXT) @@ websearch_to_tsquery('english_nostop', 'Hello'::TEXT)
        OR label_id = ANY(ARRAY(SELECT * FROM related_labels)))
ORDER BY
    "rank20534463617833501" DESC