Postgresql tsvector field and query support

gdmoore commented 4 years ago

Have you considered support for Postgresql's tsvector type? ( https://www.postgresql.org/docs/12/textsearch.html )

I realize this would be entirely Postgresql specific so perhaps it would be more appropriate as a contrib module than a core feature.

I have been looking at implementing it and had a few different ideas; something like this perhaps:

custom Field which creates a tsvector column type
meta-field listing in a Model to describe which columns to full-text index
Add a CREATE TRIGGER to the schema generation for each tsvector field. For each insert to the table, invoke tsvector_update_trigger ( https://www.postgresql.org/docs/12/functions-textsearch.html ) for the new row to create the vector on the fields listed by your meta-field. This is the thing I'm least sure about; should this be implemented in Tortoise, or just use the database trigger to manage it?
custom query Function which allows for the '@@ tsquery' syntax

One thing that Tortoise is missing is that it does not appear to be possible to tell Tortoise what index type you'd like when specifying to index a column; you would want to index the tsvector columns with either GIST or GIN depending on user requirements.

grigi commented 4 years ago

Yes, I have been planning to specify index type, especially on text fields where there are many different types of full-text indexes one can do.

The @@ syntax is not something I was aware of beforehand.

Could you propose a way to define custom index types? I would like to see if I could roll this in with MySQL's FULLTEXT index so one could have a single way of specifying it, but specifying specifically would be perfectly fine too.

Do you ever see a scenario where one would appy different types of indexes on the same column set?

I could think that instead of just index=True one could do index=indextype.PG_GIN_TSVECTOR or index=indextype.FULLTEXT (FULLTEXT could just point to PG_GIN_TSVECTOR or MS_FULLTEXT depending on db?)

Then the question comes on how to extend the index_together= in Meta as there is no place for named kwargs? Possibly if the last item in the tuple is an indextype then it uses that? It's not very intuitive but the entire list-of-list thing isn't very intuitive?

:thinking:

gdmoore commented 4 years ago

At this time I don't have any good ideas for how to define custom types. I think the best usability for developers would be if you have implemented all of the choices for all of the databases you support. Being able to just reference tortoise.indices for generic cross-DB ones like BTREE and tortoise.indices.<database> to reference DB-specific ones . That might be awful for you to support though.

I can definitely see scenarios where you have different types of indexes on the same column set, and possibly multiple overlapping indexes, depending entirely on the queries your app needs to run.

I think for single columns your proposal is exactly what I'd expect. Or perhaps index=True means to index using whatever the database uses by default, while index=indices.GIN chooses a specific one, as you suggested. I would definitely not make the FULLTEXT choice for the user though -- in Postgres' case there are two (GIST and GIN) and they have very different performance for insert/query and which to use will depend on your app's needs.

I was previously using PonyORM and it has a different approach for multi-column indices. Instead of a meta field as you have, it's an additional member of the model class: https://docs.ponyorm.org/api_reference.html#composite-indexes

Do you have any thoughts on the responsibility of parsing the model fields to update the tsvector column? It sort of needs to be a "special" column, in that you would not want to have to specify how to build the tsvector document on each model instance created; you'd want something more like a field which auto-tsvectorizes a list of columns defined in the model. Sort of the same idea as a datetime field being set to automatically have the value NOW(), you would have a tsvector column for a Book model which automatically uses the content of the Title and Body fields.

edit: forgot to conclude with, you could create a trigger and use that function I previously mentioned to do this, OR generate the SQL to do it yourself during the INSERT. If you generate the SQL yourself you need to send your data twice, where you have the original column data and then repeat it again inside a to_tsvector(...) for the tsvector column.

gdmoore commented 4 years ago

Also is it currently possible to execute a database function when returning a value in a field's to_db_value ? If I take the approach of doing everything in Python and not relying on DB triggers then I can do something like:

    message = fields.TextField()
    text = TsVectorField(search_fields=("message",))

and have this (bad, proof of concept) TsVectorField class:

class TsVectorField(fields.Field):
    """
    """

    # non-Postgres: unsupported
    SQL_TYPE = None

    class _db_postgres:
        SQL_TYPE = "TSVECTOR"

    def __init__(self, search_fields: Sequence[str] = (), **kwargs: Any) -> None:
        self.search_fields = search_fields
        super().__init__(**kwargs)

    def to_db_value(
        self, value: Any, instance: "Union[Type[Model], Model]"
    ) -> Optional[Any]:
        if hasattr(instance, "_saved_in_db"):
            content = " ".join(
                [getattr(instance, field, "") for field in self.search_fields]
            )
            # TODO need to to_tsvector(content)
            return content
        return value

but returning the assembled content won't have Postgres run the tokenization of the data; an explicit to_tsvector("data") is needed. I don't see any instances of this behaviour for any of the other fields so I'm not clear if it's doable.

There are explicit casts with the function_cast functionality used by DecimalField for sqlite, but that's a cast, not a function call. I'm trying to abuse that functionality now to get it to use to_tsvector instead of generating a cast. edit: I'm not really clear when this is used, looks like it's used in an F() object?

grigi commented 4 years ago

Compound indexes only work on generic index types, so that would be BTREE or HASH? (and HASH is discouraged use) so I'm thinking that it currently is fine.

index=True would be the default index type for that field-db combination. Right now it is always a BTREE, but I feel an BTREE on a TextField doesn't make much sense, for example.

I like your idea of a tsvector field. A new class of search fields? I was also thinking of function_cast but we need two separate parts here, the query and the document? Also Looking at PyPika it doesn't currently support these extra syntax.

I will have to think as to a best way of doing this. I would like to have something that also works with MySQL's FULLTEXT index (it seems a lot less powerful though).

gdmoore commented 4 years ago

Trying to have something that supports both Postgres and Mysql is a good idea.

My initial proof of concept here isn't ideal because the search field is a special case. You don't particularly want users putting data directly into the search field, nor do you want them reading it in query results; it's really metadata for the DB, not user data. I was just going to know to ignore it in my application.

For now since there is no nice way to do that to_tsvector cast, I'm going to use a trigger to create the field's data on insert or update. I'll change the class so that at schema generation time it generates the CREATE TRIGGERs for all of the search_fields provided.

Then, I'll start looking at how to get a query generated with the right syntax for search. I'll share my findings here so you have a few examples of use.

gdmoore commented 4 years ago

Using pypika it should be fairly simple to generate the tsvector @@ tsquery syntax; an example is here: https://github.com/kayak/pypika/issues/349#issuecomment-599770845

Perhaps an additional filter field called __search? Book.filter(searchfield__search="dark stormy night")

would generate SELECT * FROM Book WHERE searchfield @@ plainto_tsquery('dark stormy night')

note that there are multiple to_tsquery functions; 'plain' accepts a mostly freeform string. Perhaps multiple filter fields for the various tsquery functions?

grigi commented 4 years ago

A special filter makes sense. I'm quite sick at the moment (and thinking is difficult), so I won't get to anything soon. Possibly @abondar can help in the mean time if you want to attempt it?

gdmoore commented 4 years ago

Get well soon.

For now I've come up with the following workaround which ticks most boxes for me. This is all very proof-of-concept quality code, but I'm sharing it to give some usage examples.

After initializing my database with await Tortoise.generate_schemas(safe=True), I invoke the following:

find all models using Tortoise.describe_modules
check each model for TSVECTOR columns (the TsVectorField I listed previously)
for each of those columns, find out what columns are referenced for searching (search_fields kwarg)
generate DDL to create a GIN index, drop trigger, and create trigger

Now whenever data is inserted into the table, the tsvector is automatically updated.

async def create_vectors():

    # Get all models and check for any with TSVECTOR columns
    described_models = Tortoise.describe_models(serializable=False)

    for model_name, model_dict in described_models.items():
        tsvector_fields: List[Mapping[str, str]] = []

        # Find tsvector fields
        for field_dict in model_dict["data_fields"]:
            if field_dict["db_field_types"].get("postgres", None) == "TSVECTOR":
                tsvector_fields.append(field_dict)

        # Manipulate database for each
        for tsvector_field in tsvector_fields:
            db_column = tsvector_field["db_column"]
            field_name = tsvector_field["name"]
            table_name = model_dict["table"]

            # ugh TODO get database.[fully qualified model name] in a nicer way
            model_class = getattr(models, model_name.split(".")[-1])
            search_fields = ", ".join(
                model_class._meta.fields_map[field_name].search_fields
            )

            trigger_name = f"tsvector_update_{table_name}_{db_column}"

            # Create the index ONLY if it did not already exist.
            # Always drop and recreate the trigger; there is no CREATE TRIGGER IF NOT EXISTS.
            ddl_stmt = f"""
            CREATE INDEX IF NOT EXISTS tsvector_{table_name}_{db_column}_gin
            ON {table_name}
            USING gin
            ({db_column});

            DROP TRIGGER IF EXISTS {trigger_name} ON {table_name};

            CREATE TRIGGER {trigger_name} BEFORE INSERT OR UPDATE
            ON {table_name} 
            FOR EACH ROW EXECUTE PROCEDURE
            tsvector_update_trigger({db_column}, 'pg_catalog.english', {search_fields});
            """

            async with in_transaction() as connection:
                await connection.execute_script(ddl_stmt)

abondar commented 4 years ago

I'll need some more time to investigate, but I think, tomorrow, I will come up with design and hopefully implementation

I'll post design here when it will be ready

abondar commented 4 years ago

So, I've been thinking about once again "borrowing" some design from Django. For now, it seems a lot of work to make, but I think it should be like this:

SearchVector should be inherit from Function. To allow such inheritance we should refactor Function to allow multiple expressions passed to __init__ which will be controlled by arity attribute on function classes.

SearchVectorField should be done as unselectable field, because there is no real reason for it to be selected alongside with other fields (correct me if I am wrong). So, because it is not selectable I don't think there is any reason to implement to_db_value and to_python_value for it. SearchVectorField will rise error on index=True saying that default index is not applicable for this type (or may be we should set GIN as default?)

Also new field indexes for Meta class is introduced. Alongside with it we introduce new base class Index and it's descendants for each db driver. All old means of creating indexes will be in fact just a shortcuts to directly creating indexes like that:

class Meta:
    indexes = [
         BTreeIndex("field_1"),
         GINIndex("vector_field"),
    ]

SearchVectorField should have create_trigger param to generate trigger on schema create

SearchQuery is introduced. Should be descendant of Function. Will have real SQL function grabbed on the go depending on provided search_type with following mapping:

{
        'plain': 'plainto_tsquery',
        'phrase': 'phraseto_tsquery',
        'raw': 'to_tsquery',
        'websearch': 'websearch_to_tsquery',
}

Also __search suffix for filter is introduced for dummy search with vectorizing on the go, that will implicitly call to_tsvector and plainto_tsquery

Also there are some functionality regarding ranking of results, but I think we should save them for later discussions.

That looks like quite big amount of changes, so I would like to split it in several releases. 1) I think for first release it would suffice to implement __search suffix and SearchQuery with SearchVector functions, that will allow basic search for non-performant solutions. 2) Next release should be about SearchVectorField and indexes refactoring 3) Ranking, weights and etc looks like thrird concluding release for that cycle

gdmoore commented 4 years ago

That looks good to me. I think the only thing missing is: how will your SearchVectorField specify which columns to use in the search?

As you said there is no to_[db|python]_value needed. In my code I do have the field unconditionally set null=True since I believe that's needed (as the trigger won't run until after the insert). edit: realized I used BEFORE in the trigger so actually the field being nullable is irrelevant too

I agree that it should be unselectable. It's possible that someone could want it for some reason but it's a very odd requirement to need it in your Python code.

Will your Index class allow for indexes over multiple fields? (Doesn't make sense for all index types but does for eg btree BTreeIndex("field1", "field2") )

xinbinhuang commented 3 years ago

Text search would be a nice feature to add! I am working on a green field project, which requires text search on some of the fields. It would be great if I can use tortoise as an async ORM with text search support, but I think I am stuck with SQLAlchemy for now..

long2ice commented 3 years ago

Could you give a link for SQLAlchemy?

xinbinhuang commented 3 years ago

@long2ice Sorry, I missed your message earlier. Here is the link for SQLAlchemy: https://docs.sqlalchemy.org/en/14/dialects/postgresql.html#full-text-search

long2ice commented 3 years ago

Check https://github.com/tortoise/tortoise-orm/pull/687

tinducvo commented 3 years ago

Hey guys, I spent a couple hours and it looks like currently:

__search can be used with MySQL or Postgres with dynamic ToTsVector and ToTsQuery conversion
TSVectorField support has been added
GinIndex support has been added => single column, unindexed searches on non-TSVectorFields can currently be done

How are we supposed to use TSVectorField?

How to keep the the TSVectorField updated? (triggers, sql, or python?)
How to search from it? (__search always uses ToTsVector, converting unnecessarily => bad results?) => if we can do this then, we can just use the GinIndex for multiple column, indexed searches on TSVectorFields

tinducvo commented 3 years ago

I found that Postgres 12 added auto-generated and saved tsvectors: https://pgdash.io/blog/postgres-12-generated-columns.html. This could keep TsVectorField updated instead of using triggers. Seems like it almost fit into the generated_sql, but needs fields to compute from

niggle commented 1 year ago

Hi everyone,

i started using tortoise orm, now i need to optimize some queries with tsVector for faster search, but i don't find anywhere ho to implement that. Is this already working or still i development?

neonarc4 commented 5 months ago

hey plz add support to vector quiet getting popularity @long2ice cause of ai

tortoise / tortoise-orm

Postgresql tsvector field and query support #323