tomjaguarpaw / haskell-opaleye

Other
599 stars 115 forks source link

[FTS] full text search implementation #563

Closed CGenie closed 2 months ago

CGenie commented 1 year ago

Continuation of https://github.com/tomjaguarpaw/haskell-opaleye/pull/561 as I don't know how to change the base branch in an existing merge request.

tomjaguarpaw commented 1 year ago

Sorry about this taking a long time. I have a lot of other things in my queue. Feel free to ping me if you think it's taking too long.

CGenie commented 1 year ago

So far we're using our own branch for this. No worries, take your time, it's how open source works, we all have life and work to do :)

tomjaguarpaw commented 8 months ago

Hi @CGenie, sorry again that this seems to have escaped my attention. Is this something you're still using and want to get into Opaleye?

CGenie commented 8 months ago

Yes, we are still using it and yes, I think a library that advertises itself as "targeting PostgreSQL" should have postgres arrays, json and FTS, in particular.

tomjaguarpaw commented 8 months ago

OK, I'll get this merged. Can you please clarify whether you actually want functions of type

Field SqlText -> Field SqlTSVector
Field SqlText -> Field SqlTSQuery

or if it's sufficient to have

String -> F.Field P.SqlTSQuery
String -> F.Field P.SqlTSVector

Then I'll decide where best to put them. If the former, I think they belong in Opaleye.Operators.

(I don't use this functionality so I don't know how it's normally used)

CGenie commented 8 months ago

We use it like this:

restrict -< (_ns_search row)    @@ (sqlToTSQuery (unpack q))

I guess it's easier to have that wrapped into a field as we compare fields directly and if you want to provide a raw string, just wrap it in sqlString. I think such wrapping is simpler than unwrapping (as one should handle the null case, no?)

Basically, full text search is similar to the "LIKE" operator, it is more powerful though as it can tokenize text (hence the TSVector column) which is language-specific. See https://www.postgresql.org/docs/current/textsearch-controls.html. Having the tsvector column gives us pre-computed vectors of text tokens and then we can search large amounts of texts quite fast. So tsvector is like a special index column for text.

I guess some info can be found here: https://stackoverflow.com/a/224726

One can also present highlighted results (https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-HEADLINE) but we don't use it at this moment.

Hopefully I gave some insight into this feature :)

tomjaguarpaw commented 8 months ago

Thanks! I don't see where sqlToTSQuery is defined though. It's not in this PR.

tomjaguarpaw commented 8 months ago

Oh, I see, it's here: https://github.com/adinapoli/haskell-opaleye/commit/e9a29582ac66198dd2c2fdc3f8c8a4b1e6fbe004#diff-fec920cc10c90940a3a7076270ca09361bbefd1b3788d80c0ce9558e1541c92f

tomjaguarpaw commented 8 months ago

Thanks. After playing around with it I think to expose the full set of functionality you probably want

(@@) :: Field T.SqlTSVector -> Field T.SqlTSQuery -> F.Field T.SqlBool
tsvector :: Field T.SqlString -> Field T.SqlVector
to_tsvector :: Field T.SqlString -> Field T.SqlVector
tsquery :: Field T.SqlString -> Field T.SqlVector
to_tsquery :: Field T.SqlString -> Field T.SqlVector

Would that work?

tomjaguarpaw commented 8 months ago

And @CGenie, when I add this, can I list you as the "owner" of this functionality, to be summoned if anyone has a question about it?

CGenie commented 8 months ago

Yes, you can set me as the "owner" of this functionality.

About the functions you listed: what's the difference between tsvector and to_tsvector? Same for tsquery and to_tsquery? Types seem to be the same.

tomjaguarpaw commented 8 months ago

It looks like the to_ functions do normalization, as explained at https://www.postgresql.org/docs/current/datatype-textsearch.html. For example

postgres=# SELECT tsvector('Cat Fat Fat Fat');
  tsvector   
-------------
 'Cat' 'Fat'
(1 row)

postgres=# SELECT to_tsvector('Cat Fat Fat Fat');
     to_tsvector     
---------------------
 'cat':1 'fat':2,3,4
(1 row)
CGenie commented 8 months ago

Actually there are a few more: plainto_tsquery, phraseto_tsquery and websearch_to_tsquery (https://www.postgresql.org/docs/16/textsearch-controls.html) and we make use of them as well. I guess tsvector, to_tsvector, tsquery, to_tsquery, plainto_tsquery, phraseto_tsquery, websearto_to_tsquery as well as (@@) would make for a more or less complete functionality.

tomjaguarpaw commented 8 months ago

I see. This gets more and more complicated and I'm doubting my ability to review or maintain this functionality. I think my recommendation would be that interested parties make an opaleye-textsearch package that exposes the functionality they want (using the OpOther constructor of BinOp rather than adding (:@@)) and then incorporating it into Opaleye after it's proven itself out in practice, say after a year or so.

How does that sound?

CGenie commented 8 months ago

OK, I'll try to follow this recommendation then. Thanks!

CGenie commented 2 months ago

If anyone is interested, the abovementioned package is here https://hackage.haskell.org/package/opaleye-textsearch-0.1.0.0