Move responsibility for text splitting configuration to the indexed type

wagtail / wagtail-vector-index

Store Wagtail pages & Django models as embeddings in vector databases

https://wagtail-vector-index.readthedocs.io/en/latest/

MIT License

20 stars 13 forks source link

Move responsibility for text splitting configuration to the indexed type #53

Closed tomusher closed 7 months ago

tomusher commented 8 months ago

Previously, the splitting behaviour was defined through settings on an embedding backend. If you want to split differently for different indexes, you'd need to create multiple embedding backends.

As the only thing that needs to be aware of how content is split is the VectorIndexable object itself, the logic has now moved there and any customisations can be made on the indexed type directly.

tm-kn commented 8 months ago

If you want to split differently for different indexes, you'd need to create multiple embedding backends.

In some ways, you say that you might split differently for different indexes, but then you move the responsibility onto the data object, rather than the vector index class. Is where this lives now a long-term solution? I don't actually fully understand where is the most useful location for this, so I am being very naive here.

tomusher commented 8 months ago

In some ways, you say that you might split differently for different indexes, but then you move the responsibility onto the data object, rather than the vector index class. Is where this lives now a long-term solution? I don't actually fully understand where is the most useful location for this, so I am being very naive here.

As part of https://github.com/wagtail/wagtail-vector-index/pull/54 that logic now lives in the DocumentConverter. Still not entirely sure that's the most logical place but at least it's somewhat composable.