nextcloud / context_chat_backend

GNU Affero General Public License v3.0
4 stars 5 forks source link

[bug]: "Exception in ASGI application" with usernames longer than 56 characters #39

Open telepath opened 3 months ago

telepath commented 3 months ago

Describe the bug I'm using Keycloak for authentication to nextcloud, making the usernames a 64 character hex string. When I try to use the context chat, it fails with the error message AssertionError: Error: invalid user_id format, should consist of alphanumeric characters, hyphen, underscore, dot, and space only. Length should not exceed 56 characters.

To Reproduce Steps to reproduce the behavior:

  1. have context chat and context chat backend installed
  2. log into nextcloud with a user that has a username longer than 56 characterrs
  3. use the context chat
  4. check backend log for error

Expected behavior The bot should give an answer to the query

Server logs (if applicable)

``` ```

Context Chat Backend logs (if applicable, from the docker container)

``` ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi result = await app( # type: ignore[func-returns-value] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__ return await self.app(scope, receive, send) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/uvicorn/middleware/message_logger.py", line 84, in __call__ raise exc from None File "/usr/local/lib/python3.11/dist-packages/uvicorn/middleware/message_logger.py", line 80, in __call__ await self.app(scope, inner_receive, inner_send) File "/usr/local/lib/python3.11/dist-packages/fastapi/applications.py", line 1115, in __call__ await super().__call__(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/applications.py", line 122, in __call__ await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/errors.py", line 184, in __call__ raise exc File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/errors.py", line 162, in __call__ await self.app(scope, receive, _send) File "/app/context_chat_backend/ocs_utils.py", line 75, in __call__ await self.app(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/exceptions.py", line 79, in __call__ raise exc File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__ await self.app(scope, receive, sender) File "/usr/local/lib/python3.11/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__ raise e File "/usr/local/lib/python3.11/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__ await self.app(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 718, in __call__ await route.handle(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 276, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 66, in app response = await func(request) ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/fastapi/routing.py", line 274, in app raw_response = await run_endpoint_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/fastapi/routing.py", line 193, in run_endpoint_function return await run_in_threadpool(dependant.call, **values) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/starlette/concurrency.py", line 41, in run_in_threadpool return await anyio.to_thread.run_sync(func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/anyio/to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread return await future ^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/anyio/_backends/_asyncio.py", line 807, in run result = context.run(func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/context_chat_backend/utils.py", line 73, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/context_chat_backend/controller.py", line 200, in _ result = embed_sources(db, sources) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/context_chat_backend/chain/ingest/injest.py", line 167, in embed_sources return _process_sources(vectordb, sources_filtered) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/context_chat_backend/chain/ingest/injest.py", line 120, in _process_sources filtered_docs = _filter_documents(user_id, vectordb, documents) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/context_chat_backend/chain/ingest/injest.py", line 36, in _filter_documents existing_objects = vectordb.get_objects_from_metadata( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/context_chat_backend/vectordb/chroma.py", line 92, in get_objects_from_metadata self.setup_schema(user_id) File "/app/context_chat_backend/vectordb/chroma.py", line 47, in setup_schema self.client.get_or_create_collection(get_collection_name(user_id)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/context_chat_backend/vectordb/__init__.py", line 29, in get_collection_name raise AssertionError('Error: invalid user_id format, should consist of alphanumeric characters, hyphen, underscore, dot, and space only. Length should not exceed 56 characters.') # noqa: E501 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ```

Screenshots If applicable, add screenshots to help explain your problem.

Setup Details (please complete the following information):

Additional context I'm not sure if changing the user IDs would help, since even email addresses can get longer than 56 characters.

kyteinsky commented 3 months ago

Hello, yeah that's an issue, we can't even consider 64 characters as the ceiling.
The most crude solution that comes to mind could be to hash the user id with something like md5 to get a 32 char unique derivation of the user id. Maybe we can come up with a better solution since this would definitely not be ideal. If we try to maintain backward compatibility, we would be hashing only usernames that cross the length limit, messing the db up further.

The user id is used to generate a table name with the format 'Vector_\<userid>' btw. Chroma db is throwing the error that you see.

telepath commented 3 months ago

Using a hash seems like a valid solution for a unique table name, but as you say, backward compatibility would be an issue. The clean solution would be a migration to hashes, but that makes downgrading impossible, of course... Without a hash you could have generic table names and a lookup table. In any case, the current method seems to be only valid for specific cases, so some other solution needs to be found, backward compatible or not...

telepath commented 3 months ago

After looking at the code I get the impression that using a hash would have been the easier solution from the start. It removes the need for all the checks, which aren't feasible anyway, since nobody will format their usernames after the requirements of an addon:

--- a/context_chat_backend/vectordb/__init__.py
+++ b/context_chat_backend/vectordb/__init__.py
@@ -1,5 +1,6 @@
 import re
 from importlib import import_module
+from hashlib import sha224

 from .base import BaseVectorDB, MetadataFilter

@@ -25,25 +26,7 @@ def get_collection_name(user_id: str) -> str:
        if user_id in user_id_cache:
                return user_id_cache[user_id]

-       if not re_user_id.match(user_id):
-               raise AssertionError('Error: invalid user_id format, should consist of alphanumeric characters, hyphen, underscore, dot, and space only. Length should not exceed 56 characters.')  # noqa: E501
-
-       # should not end in a special character
-       if user_id[-1] in '_.-@ ':
-               raise AssertionError('Error: user_id should not end in a special character.')
-
-       # replace space with double underscore
-       user_id = user_id.replace(' ', '__')
-       # replace consecutive dots with .n. (n is the number of dots)
-       user_id = re.sub(r'\.{2,}', lambda m: f'.{len(m.group())}.', user_id)
-       # replace @ with .at.
-       user_id = user_id.replace('@', '.at.')
-
-       # recheck length constraints
-       if len(user_id) > 56:
-               raise AssertionError(f'Error: length of cleaned up user_id should not exceed 56 characters, processed username: {user_id}.')  # noqa: E501
-
-       collection_name = f'Vector_{user_id}'
+       collection_name = f'Vector_{sha224(user_id).hexdigest()}'
        user_id_cache[user_id] = collection_name

        return collection_name

Of course this affects get_user_id_from_collection(), but on first glance I can't see that it is used anywhere apart from testing.

So to me it looks like this would be the better way to do it. And regarding backward compatibility, it is still better to break it now than later.

The only remaining issue is the possibility of collisions, but that's only a theoretical problem and will be vastly more rare than the current issue.

kyteinsky commented 3 months ago

Hello, I discussed this issue with senior developers, they mention the same issues you do like the collision and maintenance of reverse lookup table for user ids.

We think it would be nice to use this opportunity to change the schema to support ACLs (and common documents) which would prevent duplication of document chunks and embeddings for shared documents. We plan on making two collections. One for vectors and the other for ACLs. This is due to the fact that Chroma DB does not support list type of metadata and metadata filters don't match on partial strings afaik. We could drop Chroma and just use Weaviate but the latter does not support schema change and we would like to have that for now, given the dynamic nature of this software at this stage. We're looking into the finer details but that is the general direction we think would solve both the issues and be future proof for a while.

Of course this affects get_user_id_from_collection(), but on first glance I can't see that it is used anywhere apart from testing.

It is used to get all the users to delete a metadata filtered set of documents (https://github.com/nextcloud/context_chat_backend/blob/master/context_chat_backend/vectordb/base.py#L212, https://github.com/nextcloud/context_chat_backend/blob/master/context_chat_backend/vectordb/chroma.py#L41). (necessary for custom data provider support)

And regarding backward compatibility, it is still better to break it now than later.

yup

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 4 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.