pietermartin / sqlg

TinkerPop graph over sql
MIT License
244 stars 51 forks source link

Support PostgreSQL full text search #163

Closed JPMoresmau closed 7 years ago

JPMoresmau commented 7 years ago

There is no support in Gremlin for full-text search predicates, but we could add our own predicates that would use the full text PostgreSQL operators. This would be very useful for us (mix normal Gremlin queries with full text search).

See https://www.postgresql.org/docs/current/static/textsearch-tables.html#TEXTSEARCH-TABLES-SEARCH. I suppose we should just offer predicates and hope that the full text indices have been created outside of sqlg.

pietermartin commented 7 years ago

There are already org.umlg.sqlg.predicate.Text That might cover what you are looking for already.

JPMoresmau commented 7 years ago

No, these predicates become things like LIKE in SQL, not the full text operators @@ and such. But these Text predicates are a good guide!

pietermartin commented 7 years ago

Ok, just looked at the link you sent, not really familiar with that syntax.

pietermartin commented 7 years ago

Perhaps on AbstractLabel.ensureIndexExists we can add a IndexType.GIN? Then the standard Topology creation api can support GIN indexes.

JPMoresmau commented 7 years ago

Mmmmhhh, but the column clause of a Gin index is not a list of columns, but an expression. So we can't model that via a list of PropertyColumn.

JPMoresmau commented 7 years ago

Also, the index expression could be a concatenation of columns, say, and you can also search on that same expression for the index to be used. But currently I just use a predicate, so to use with has() we need to provide one column name. Ideally we should be able to provide the full search expression. I'm not aware of any mechanism in Gremlin to express a vertex-level predicate?

pietermartin commented 7 years ago

Here is my reading of it.

GIN index column must be of type tsvector. We can assume that to_tsvector will always be on a column or a list of columns. AbstractLabel.ensureIndexExists takes a list of columns.

CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', title || ' ' || body));

can become

pgwebVertexLabel.ensureIndexExists(IndexType.GIN, Arrays.toList(titlePropertyColumn, bodyPropertyColumn))

The information that is not specified is that 'english' bit. Not sure how to get around that. Maybe some kind of special PropertyColumn so we can write, either with an actual real config_name column or a jippo one

//there is an actual config_name column
pgwebVertexLabel.ensureIndexExists(IndexType.GIN, Arrays.toList(configNamePropertyColumn, titlePropertyColumn, bodyPropertyColumn))

or

//a fake PropertyColumn
pgwebVertexLabel.ensureIndexExists(IndexType.GIN, Arrays.toList(PropertyColumn.ENGLISH_GIN, titlePropertyColumn, bodyPropertyColumn))

Does it make sense or am I missing something?

JPMoresmau commented 7 years ago

This is what I'm doing:

I of course need to check I haven't broken anything.

One question: when there's a query with or(__.has('name','val1'),__.has('name','val2')), this doesn't get generated in SQL, right? That's another big improvement that could be added.

pietermartin commented 7 years ago

Yes or does not get optimized. Nor where nor match. or is complicated. In fact I recall on the forums someone saying its not possible with a fluent api to have the same semantics as a declarative language like sql. I have not thought about it for a while but I what recall being difficult is to map sql's open close bracket semantics together with and and ors to gremlin. I think in gremlin they simulate it instead with the matchclause.

But yeah for-sure it will be a big improvement to have some kind of or support.

pietermartin commented 7 years ago

Done