Open gdmoore opened 4 years ago
Yes, I have been planning to specify index type, especially on text fields where there are many different types of full-text indexes one can do.
The @@
syntax is not something I was aware of beforehand.
Could you propose a way to define custom index types? I would like to see if I could roll this in with MySQL's FULLTEXT
index so one could have a single way of specifying it, but specifying specifically would be perfectly fine too.
Do you ever see a scenario where one would appy different types of indexes on the same column set?
I could think that instead of just index=True
one could do index=indextype.PG_GIN_TSVECTOR
or index=indextype.FULLTEXT
(FULLTEXT could just point to PG_GIN_TSVECTOR or MS_FULLTEXT depending on db?)
Then the question comes on how to extend the index_together=
in Meta
as there is no place for named kwargs? Possibly if the last item in the tuple is an indextype
then it uses that? It's not very intuitive but the entire list-of-list thing isn't very intuitive?
:thinking:
At this time I don't have any good ideas for how to define custom types. I think the best usability for developers would be if you have implemented all of the choices for all of the databases you support. Being able to just reference tortoise.indices
for generic cross-DB ones like BTREE and tortoise.indices.<database>
to reference DB-specific ones . That might be awful for you to support though.
I can definitely see scenarios where you have different types of indexes on the same column set, and possibly multiple overlapping indexes, depending entirely on the queries your app needs to run.
I think for single columns your proposal is exactly what I'd expect. Or perhaps index=True
means to index using whatever the database uses by default, while index=indices.GIN
chooses a specific one, as you suggested. I would definitely not make the FULLTEXT choice for the user though -- in Postgres' case there are two (GIST and GIN) and they have very different performance for insert/query and which to use will depend on your app's needs.
I was previously using PonyORM and it has a different approach for multi-column indices. Instead of a meta field as you have, it's an additional member of the model class: https://docs.ponyorm.org/api_reference.html#composite-indexes
Do you have any thoughts on the responsibility of parsing the model fields to update the tsvector column? It sort of needs to be a "special" column, in that you would not want to have to specify how to build the tsvector document on each model instance created; you'd want something more like a field which auto-tsvectorizes a list of columns defined in the model.
Sort of the same idea as a datetime field being set to automatically have the value NOW()
, you would have a tsvector column for a Book
model which automatically uses the content of the Title
and Body
fields.
edit: forgot to conclude with, you could create a trigger and use that function I previously mentioned to do this, OR generate the SQL to do it yourself during the INSERT. If you generate the SQL yourself you need to send your data twice, where you have the original column data and then repeat it again inside a to_tsvector(...)
for the tsvector column.
Also is it currently possible to execute a database function when returning a value in a field's to_db_value
? If I take the approach of doing everything in Python and not relying on DB triggers then I can do something like:
message = fields.TextField()
text = TsVectorField(search_fields=("message",))
and have this (bad, proof of concept) TsVectorField class:
class TsVectorField(fields.Field):
"""
"""
# non-Postgres: unsupported
SQL_TYPE = None
class _db_postgres:
SQL_TYPE = "TSVECTOR"
def __init__(self, search_fields: Sequence[str] = (), **kwargs: Any) -> None:
self.search_fields = search_fields
super().__init__(**kwargs)
def to_db_value(
self, value: Any, instance: "Union[Type[Model], Model]"
) -> Optional[Any]:
if hasattr(instance, "_saved_in_db"):
content = " ".join(
[getattr(instance, field, "") for field in self.search_fields]
)
# TODO need to to_tsvector(content)
return content
return value
but returning the assembled content won't have Postgres run the tokenization of the data; an explicit to_tsvector("data")
is needed. I don't see any instances of this behaviour for any of the other fields so I'm not clear if it's doable.
There are explicit casts with the function_cast
functionality used by DecimalField
for sqlite, but that's a cast, not a function call. I'm trying to abuse that functionality now to get it to use to_tsvector
instead of generating a cast.
edit: I'm not really clear when this is used, looks like it's used in an F() object?
Compound indexes only work on generic index types, so that would be BTREE or HASH? (and HASH is discouraged use) so I'm thinking that it currently is fine.
index=True
would be the default index type for that field-db combination. Right now it is always a BTREE, but I feel an BTREE on a TextField doesn't make much sense, for example.
I like your idea of a tsvector field. A new class of search fields?
I was also thinking of function_cast
but we need two separate parts here, the query and the document? Also Looking at PyPika it doesn't currently support these extra syntax.
I will have to think as to a best way of doing this. I would like to have something that also works with MySQL's FULLTEXT index (it seems a lot less powerful though).
Trying to have something that supports both Postgres and Mysql is a good idea.
My initial proof of concept here isn't ideal because the search field is a special case. You don't particularly want users putting data directly into the search field, nor do you want them reading it in query results; it's really metadata for the DB, not user data. I was just going to know to ignore it in my application.
For now since there is no nice way to do that to_tsvector
cast, I'm going to use a trigger to create the field's data on insert or update. I'll change the class so that at schema generation time it generates the CREATE TRIGGER
s for all of the search_fields
provided.
Then, I'll start looking at how to get a query generated with the right syntax for search. I'll share my findings here so you have a few examples of use.
Using pypika it should be fairly simple to generate the tsvector @@ tsquery
syntax; an example is here: https://github.com/kayak/pypika/issues/349#issuecomment-599770845
Perhaps an additional filter field called __search
?
Book.filter(searchfield__search="dark stormy night")
would generate
SELECT * FROM Book WHERE searchfield @@ plainto_tsquery('dark stormy night')
note that there are multiple to_tsquery
functions; 'plain' accepts a mostly freeform string. Perhaps multiple filter fields for the various tsquery functions?
A special filter makes sense. I'm quite sick at the moment (and thinking is difficult), so I won't get to anything soon. Possibly @abondar can help in the mean time if you want to attempt it?
Get well soon.
For now I've come up with the following workaround which ticks most boxes for me. This is all very proof-of-concept quality code, but I'm sharing it to give some usage examples.
After initializing my database with await Tortoise.generate_schemas(safe=True)
, I invoke the following:
TsVectorField
I listed previously)search_fields
kwarg)Now whenever data is inserted into the table, the tsvector is automatically updated.
async def create_vectors():
# Get all models and check for any with TSVECTOR columns
described_models = Tortoise.describe_models(serializable=False)
for model_name, model_dict in described_models.items():
tsvector_fields: List[Mapping[str, str]] = []
# Find tsvector fields
for field_dict in model_dict["data_fields"]:
if field_dict["db_field_types"].get("postgres", None) == "TSVECTOR":
tsvector_fields.append(field_dict)
# Manipulate database for each
for tsvector_field in tsvector_fields:
db_column = tsvector_field["db_column"]
field_name = tsvector_field["name"]
table_name = model_dict["table"]
# ugh TODO get database.[fully qualified model name] in a nicer way
model_class = getattr(models, model_name.split(".")[-1])
search_fields = ", ".join(
model_class._meta.fields_map[field_name].search_fields
)
trigger_name = f"tsvector_update_{table_name}_{db_column}"
# Create the index ONLY if it did not already exist.
# Always drop and recreate the trigger; there is no CREATE TRIGGER IF NOT EXISTS.
ddl_stmt = f"""
CREATE INDEX IF NOT EXISTS tsvector_{table_name}_{db_column}_gin
ON {table_name}
USING gin
({db_column});
DROP TRIGGER IF EXISTS {trigger_name} ON {table_name};
CREATE TRIGGER {trigger_name} BEFORE INSERT OR UPDATE
ON {table_name}
FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger({db_column}, 'pg_catalog.english', {search_fields});
"""
async with in_transaction() as connection:
await connection.execute_script(ddl_stmt)
I'll need some more time to investigate, but I think, tomorrow, I will come up with design and hopefully implementation
I'll post design here when it will be ready
So, I've been thinking about once again "borrowing" some design from Django. For now, it seems a lot of work to make, but I think it should be like this:
SearchVector
should be inherit from Function
. To allow such inheritance we should refactor Function
to allow multiple expressions passed to __init__
which will be controlled by arity
attribute on function classes.
SearchVectorField
should be done as unselectable field, because there is no real reason for it to be selected alongside with other fields (correct me if I am wrong). So, because it is not selectable I don't think there is any reason to implement to_db_value
and to_python_value
for it.
SearchVectorField
will rise error on index=True
saying that default index is not applicable for this type (or may be we should set GIN
as default?)
Also new field indexes
for Meta
class is introduced. Alongside with it we introduce new base class Index
and it's descendants for each db driver. All old means of creating indexes will be in fact just a shortcuts to directly creating indexes like that:
class Meta:
indexes = [
BTreeIndex("field_1"),
GINIndex("vector_field"),
]
SearchVectorField
should have create_trigger
param to generate trigger on schema create
SearchQuery
is introduced. Should be descendant of Function
. Will have real SQL function grabbed on the go depending on provided search_type
with following mapping:
{
'plain': 'plainto_tsquery',
'phrase': 'phraseto_tsquery',
'raw': 'to_tsquery',
'websearch': 'websearch_to_tsquery',
}
Also __search
suffix for filter
is introduced for dummy search with vectorizing on the go, that will implicitly call to_tsvector
and plainto_tsquery
Also there are some functionality regarding ranking of results, but I think we should save them for later discussions.
That looks like quite big amount of changes, so I would like to split it in several releases.
1) I think for first release it would suffice to implement __search
suffix and SearchQuery
with SearchVector
functions, that will allow basic search for non-performant solutions.
2) Next release should be about SearchVectorField
and indexes refactoring
3) Ranking, weights and etc looks like thrird concluding release for that cycle
That looks good to me. I think the only thing missing is: how will your SearchVectorField
specify which columns to use in the search?
As you said there is no to_[db|python]_value
needed. In my code I do have the field unconditionally set null=True
since I believe that's needed (as the trigger won't run until after the insert). edit: realized I used BEFORE in the trigger so actually the field being nullable is irrelevant too
I agree that it should be unselectable. It's possible that someone could want it for some reason but it's a very odd requirement to need it in your Python code.
Will your Index
class allow for indexes over multiple fields? (Doesn't make sense for all index types but does for eg btree BTreeIndex("field1", "field2")
)
Text search would be a nice feature to add! I am working on a green field project, which requires text search on some of the fields. It would be great if I can use tortoise as an async ORM with text search support, but I think I am stuck with SQLAlchemy for now..
Could you give a link for SQLAlchemy?
@long2ice Sorry, I missed your message earlier. Here is the link for SQLAlchemy: https://docs.sqlalchemy.org/en/14/dialects/postgresql.html#full-text-search
Hey guys, I spent a couple hours and it looks like currently:
How are we supposed to use TSVectorField?
I found that Postgres 12 added auto-generated and saved tsvectors: https://pgdash.io/blog/postgres-12-generated-columns.html. This could keep TsVectorField updated instead of using triggers. Seems like it almost fit into the generated_sql, but needs fields to compute from
Hi everyone,
i started using tortoise orm, now i need to optimize some queries with tsVector for faster search, but i don't find anywhere ho to implement that. Is this already working or still i development?
hey plz add support to vector quiet getting popularity @long2ice cause of ai
Have you considered support for Postgresql's tsvector type? ( https://www.postgresql.org/docs/12/textsearch.html )
I realize this would be entirely Postgresql specific so perhaps it would be more appropriate as a contrib module than a core feature.
I have been looking at implementing it and had a few different ideas; something like this perhaps:
CREATE TRIGGER
to the schema generation for each tsvector field. For each insert to the table, invoke tsvector_update_trigger ( https://www.postgresql.org/docs/12/functions-textsearch.html ) for the new row to create the vector on the fields listed by your meta-field. This is the thing I'm least sure about; should this be implemented in Tortoise, or just use the database trigger to manage it?One thing that Tortoise is missing is that it does not appear to be possible to tell Tortoise what index type you'd like when specifying to index a column; you would want to index the tsvector columns with either GIST or GIN depending on user requirements.