toluaina / pgsync

Postgres to Elasticsearch/OpenSearch sync
https://pgsync.com
MIT License
1.18k stars 182 forks source link

PGSync does not allow use of the `knn_vector` type #263

Closed kingkupps closed 2 years ago

kingkupps commented 2 years ago

PGSync version: 2.2.1

Postgres version: 12.5

Elasticsearch version: Opensearch 1.2.4

Redis version: 6.0.10

Python version: 3.10

Problem Description: PGSync does not allow fields mapped to the knn_vector type described here.

Schema:

[
  {
    "database": "mydb",
    "index": "index_v1",
    "settings": {
      "index": {
        "knn": true
      }
    },
    "nodes": {
      "table": "mytable",
      "schema": "public",
      "columns": [
        "id",
        "embedding",
        "model_tag"
      ],
      "transform": {
        "rename": {
          "id": "embedding_id"
        },
        "mapping": {
          "embedding_id": {
            "type": "long"
          },
          "embedding": {
            "type": "knn_vector",
            "dimension": 512,
            "method": {
              "name": "hnsw",
              "space_type": "cosinesimil",
              "engine": "nmslib"
            }
          },
          "model_tag": {
            "type": "string"
          }
        }
      }
    }
  }
]

Error Message (if any):

File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/bin/bootstrap", line 59, in main
    sync: Sync = Sync(
  File "/usr/local/lib/python3.10/site-packages/pgsync/sync.py", line 95, in __init__
    self.create_setting()
  File "/usr/local/lib/python3.10/site-packages/pgsync/sync.py", line 238, in create_setting
    self.es._create_setting(
  File "/usr/local/lib/python3.10/site-packages/pgsync/elastichelper.py", line 250, in _create_setting
    mapping: dict = self._build_mapping(node, routing)
  File "/usr/local/lib/python3.10/site-packages/pgsync/elastichelper.py", line 279, in _build_mapping
    raise RuntimeError(
RuntimeError: Invalid Elasticsearch type knn_vector
toluaina commented 2 years ago
kingkupps commented 2 years ago

This might be a "works-as-intended" case, but knn_vector is an OpenSearch type (docs are here).

There is also a workaround in that you can always create the index mapping ahead of time and leave off mappings in the pgsync schema. It would be really nice to be able to specify the mapping for documents that use knn_vector from the schema though.

toluaina commented 2 years ago

Hi, I haven't quite worked out how to handle differences between opensearch and elasticsearch yet and there is bound to be more of these differences. I'm happy to add knn_vector for now. Is this sufficient for your immediate use case?

kingkupps commented 2 years ago

Yes that should be good enough 👍

toluaina commented 2 years ago

thia has now been added to the master branch

ImtiazKhanDS commented 2 months ago

how to define the dimension and method fields in knn_vector type ? Getting an issue that dimension invalid elastic search mapping parameter .