pgvector / pgvector-python

pgvector support for Python
MIT License
951 stars 63 forks source link

Maybe spurious exception (ValueError: missing dimensions) when using sparsevec with django #80

Closed QBH3 closed 3 months ago

QBH3 commented 3 months ago

I believe the test in https://github.com/pgvector/pgvector-python/blob/633cbd724380d445f47e405b801964c4b60fba6a/pgvector/utils/sparsevec.py#L16 might not work as intended.

When trying to save in updated Model class in django it is called via https://github.com/pgvector/pgvector-python/blob/master/pgvector/utils/sparsevec.py#L125 and does not get a dimension as an argument.

django Model class:

from django.db import models
from pgvector.django import SparseVectorField
from pgvector.django import HnswIndex

class Node(models.Model):
    text = models.TextField()
    embedding = SparseVectorField(dimensions=30522, null=True, blank=True)  # for "naver/splade-cocondenser-ensembledistil"

    class Meta:
        indexes = [
            HnswIndex(
                name='my_index',
                fields=['embedding'],
                opclasses=['sparsevec_l2_ops']
            )
        ]

How the table looks like with psql:

                                                   Table "public.planetai_django_api_node"
    Column     |       Type       | Collation | Nullable |             Default              | Storage  | Compression | Stats target | Description
---------------+------------------+-----------+----------+----------------------------------+----------+-------------+--------------+-------------
 id            | bigint           |           | not null | generated by default as identity | plain    |             |              |
 text          | text             |           | not null |                                  | extended |             |              |
 embedding     | sparsevec(30522) |           |          |                                  | external |             |              |

You can see that the table has the same dimension as the Model class.

The Exception that was trown:

Traceback (most recent call last):
  File "PGVECTORENV/lib/python3.10/site-packages/asgiref/sync.py", line 518, in thread_handler
    raise exc_info[1]
  File "PGVECTORENV/lib/python3.10/site-packages/django/core/handlers/exception.py", line 42, in inner
    response = await get_response(request)
  File "PGVECTORENV/lib/python3.10/site-packages/django/core/handlers/base.py", line 253, in _get_response_async
    response = await wrapped_callback(
  File "PGVECTORENV/lib/python3.10/site-packages/asgiref/sync.py", line 468, in __call__
    ret = await asyncio.shield(exec_coro)
  File "PGVECTORENV/lib/python3.10/site-packages/asgiref/current_thread_executor.py", line 40, in run
    result = self.fn(*self.args, **self.kwargs)
  File "PGVECTORENV/lib/python3.10/site-packages/asgiref/sync.py", line 522, in thread_handler
    return func(*args, **kwargs)
  File "views/rag/api.py", line 69, in documents_index
    textnode.save()
  File "PGVECTORENV/lib/python3.10/site-packages/django/db/models/base.py", line 822, in save
    self.save_base(
  File "PGVECTORENV/lib/python3.10/site-packages/django/db/models/base.py", line 909, in save_base
    updated = self._save_table(
  File "PGVECTORENV/lib/python3.10/site-packages/django/db/models/base.py", line 1040, in _save_table
    updated = self._do_update(
  File "PGVECTORENV/lib/python3.10/site-packages/django/db/models/base.py", line 1105, in _do_update
    return filtered._update(values) > 0
  File "PGVECTORENV/lib/python3.10/site-packages/django/db/models/query.py", line 1278, in _update
    return query.get_compiler(self.db).execute_sql(CURSOR)
  File "PGVECTORENV/lib/python3.10/site-packages/django/db/models/sql/compiler.py", line 1990, in execute_sql
    cursor = super().execute_sql(result_type)
  File "PGVECTORENV/lib/python3.10/site-packages/django/db/models/sql/compiler.py", line 1549, in execute_sql
    sql, params = self.as_sql()
  File "PGVECTORENV/lib/python3.10/site-packages/django/db/models/sql/compiler.py", line 1953, in as_sql
    val = field.get_db_prep_save(val, connection=self.connection)
  File "PGVECTORENV/lib/python3.10/site-packages/django/db/models/fields/__init__.py", line 1013, in get_db_prep_save
    return self.get_db_prep_value(value, connection=connection, prepared=False)
  File "PGVECTORENV/lib/python3.10/site-packages/django/db/models/fields/__init__.py", line 1006, in get_db_prep_value
    value = self.get_prep_value(value)
  File "/mnt/ssd_nas_homes/chrstianbahls-10073/.cache/pypoetry/virtualenvs/planetai-django-tenderx-IbQbFZXB-py3.10/lib/python3.10/site-packages/pgvector/django/sparsevec.py", line 33, in get_prep_value
    return SparseVector._to_db(value)
  File "PGVECTORENV/lib/python3.10/site-packages/pgvector/utils/sparsevec.py", line 125, in _to_db
    value = cls(value)
  File "PGVECTORENV/lib/python3.10/site-packages/pgvector/utils/sparsevec.py", line 16, in __init__
    raise ValueError('missing dimensions')
ValueError: missing dimensions
ankane commented 3 months ago

Hi @QBH3, that error occurs if you try to create a sparse vector with a dictionary (regardless of whether dimensions are set on the column).

# error
Node(embedding={1:2}).save()

# no error
Node(embedding=SparseVector({1:2}, 30522)).save()
QBH3 commented 3 months ago

I want to suggest to rephrase the Exception as: raise ValueError('can not be initialized from a dict')