Calling Chroma.from_documents() returns sqlite3.DatabaseError: database disk image is malformed #28368

Open moore269 opened 4 hours ago

moore269 commented 4 hours ago

Checked other resources

Example Code

Below is the sample code

class PromptOperator:

    def embed_documents(self, text):
            token_provider = get_bearer_token_provider(DefaultAzureCredential(), "")
        except Exception as e:
            print(f"\033[91mError getting token provider: {e}\033[0m")

        client = AzureOpenAI(
            azure_endpoint = API_BASE,
            api_version = VERSION,
        response = client.embeddings.create(input = text,
                                            model= "text-embedding-3-small")
        results = [emb.embedding for emb in]
        return results

    def embed_query(self, text):
        result = self.embed_documents([text])
        return result[0]

po = PromptOperator()
vector_db_path = Path(db_directory_path) / "context_vector_db"

if vector_db_path.exists():
    os.system(f"rm -r {vector_db_path}")

set_permissions(str(vector_db_path), 0o755)

Chroma.from_documents(docs, EMBEDDING_FUNCTION, persist_directory=str(vector_db_path))

docs looks something like this

Document(metadata={'table_name': 'account', 'original_column_name': 'account_id', 'column_name': 'account id', 'column_description': 'the id of the account', 'value_description': ''}, page_content='the id of the account'),

Lastly, here is a dump of the current packages installed. I am also on **sqlite3.version=3.45.3**

### Error Message and Stack Trace (if applicable)

Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/jomoor1/code/Users/jomoor/CHESS-main/CHESS-main/./src/", line 59, in <module>
    worker_initializer(args.db_id, args)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/jomoor1/code/Users/jomoor/CHESS-main/CHESS-main/./src/", line 33, in worker_initializer
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/jomoor1/code/Users/jomoor/CHESS-main/CHESS-main/src/database_utils/db_catalog/", line 90, in make_db_context_vec_db
    Chroma.from_documents(docs, EMBEDDING_FUNCTION, persist_directory=str(vector_db_path))
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/langchain_chroma/", line 1128, in from_documents
    return cls.from_texts(
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/langchain_chroma/", line 1061, in from_texts
    chroma_collection = cls(
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/langchain_chroma/", line 313, in __init__
    self._client = chromadb.Client(_client_settings)
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/", line 334, in Client
    return ClientCreator(tenant=tenant, database=database, settings=settings)
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/api/", line 58, in __init__
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/api/", line 19, in __init__
    SharedSystemClient._create_system_if_not_exists(self._identifier, settings)
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/api/", line 32, in _create_system_if_not_exists
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/", line 444, in start
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/telemetry/opentelemetry/", line 150, in wrapper
    return f(*args, **kwargs)
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/db/impl/", line 104, in start
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/db/", line 140, in initialize_migrations
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/telemetry/opentelemetry/", line 150, in wrapper
    return f(*args, **kwargs)
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/db/", line 167, in apply_migrations
    db_migrations = self.db_migrations(dir)
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/telemetry/opentelemetry/", line 150, in wrapper
    return f(*args, **kwargs)
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/db/impl/", line 202, in db_migrations
sqlite3.DatabaseError: database disk image is malformed

### Description

Chroma.from_documents() appears to be failing when trying to write persisted files. I have tried using the latest python libraries. I have also tried on different python versions (3.10, 3.11, 3.12) all with the same error. The only thing I am doing that is different than standard setups is defining my own embedding object and passing it in. I made sure it had the right method (embed_documents). I found two files in the output persisted folder. Here are the contents


```CREATE TABLE embeddings_queue (
    operation INTEGER NOT NULL,
    topic TEXT NOT NULL,
    vector BLOB,
    encoding TEXT,
    metadata TEXT


CREATE TABLE embeddings_queue_config (
    config_json_str TEXT

System Info

OS: Linux OS Version: #82~20.04.1-Ubuntu SMP Tue Sep 3 12:27:43 UTC 2024 Python Version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct 4 2024, 13:27:36) [GCC 11.2.0]

Package Information

langchain_core: 0.3.21 langchain: 0.3.8 langchain_community: 0.3.8 langsmith: 0.1.146 langchain_anthropic: 0.3.0 langchain_chroma: 0.1.4 langchain_google_genai: 2.0.5 langchain_google_vertexai: 2.0.7 langchain_openai: 0.2.10 langchain_text_splitters: 0.3.2 langgraph_sdk: 0.1.36

Optional packages not installed


Other Dependencies

aiohttp: 3.11.7 anthropic: 0.39.0 anthropic[vertexai]: Installed. No version info available. async-timeout: Installed. No version info available. chromadb: 0.5.20 dataclasses-json: 0.6.7 defusedxml: 0.7.1 fastapi: 0.115.5 google-cloud-aiplatform: 1.73.0 google-cloud-storage: 2.18.2 google-generativeai: 0.8.3 httpx: 0.27.2 httpx-sse: 0.4.0 jsonpatch: 1.33 langchain-mistralai: Installed. No version info available. numpy: 1.26.4 openai: 1.55.1 orjson: 3.10.12 packaging: 24.2 pydantic: 2.9.0 pydantic-settings: 2.6.1 PyYAML: 6.0.2 requests: 2.32.3 requests-toolbelt: 1.0.0 SQLAlchemy: 2.0.35 tenacity: 9.0.0 tiktoken: 0.8.0 typing-extensions: 4.12.2

moore269 commented 2 hours ago

I just tried this on windows. It works on windows platform, but not on linux. I'm not too sure why, but maybe it has something to do with differing sqlite installs?