langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
95.37k stars 15.47k forks source link

Calling Chroma.from_documents() returns sqlite3.DatabaseError: database disk image is malformed #28368

Open moore269 opened 4 hours ago

moore269 commented 4 hours ago

Checked other resources

Example Code

Below is the sample code

class PromptOperator:

    def embed_documents(self, text):
        try:
            token_provider = get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")
        except Exception as e:
            print(f"\033[91mError getting token provider: {e}\033[0m")
            sys.exit(1)

        client = AzureOpenAI(
            azure_endpoint = API_BASE,
            azure_ad_token_provider=token_provider,
            api_version = VERSION,
        )
        response = client.embeddings.create(input = text,
                                            model= "text-embedding-3-small")
        results = [emb.embedding for emb in response.data]
        return results

    def embed_query(self, text):
        result = self.embed_documents([text])
        return result[0]

po = PromptOperator()
EMBEDDING_FUNCTION = po
vector_db_path = Path(db_directory_path) / "context_vector_db"

if vector_db_path.exists():
    os.system(f"rm -r {vector_db_path}")
    time.sleep(5)

vector_db_path.mkdir(exist_ok=True)
set_permissions(str(vector_db_path), 0o755)

Chroma.from_documents(docs, EMBEDDING_FUNCTION, persist_directory=str(vector_db_path))

docs looks something like this

Document(metadata={'table_name': 'account', 'original_column_name': 'account_id', 'column_name': 'account id', 'column_description': 'the id of the account', 'value_description': ''}, page_content='the id of the account'),
...]```

Lastly, here is a dump of the current packages installed. I am also on **sqlite3.version=3.45.3**

Package                                  Version
---------------------------------------- -----------
aiohappyeyeballs                         2.4.3
aiohttp                                  3.11.7
aiosignal                                1.3.1
aiosqlite                                0.20.0
annotated-types                          0.7.0
anthropic                                0.39.0
anyio                                    4.6.2.post1
asgiref                                  3.8.1
asttokens                                2.4.1
attrs                                    24.2.0
azure-core                               1.32.0
azure-identity                           1.19.0
azure-storage-blob                       12.24.0
backcall                                 0.2.0
backoff                                  2.2.1
bcrypt                                   4.2.1
beautifulsoup4                           4.12.3
bleach                                   6.2.0
build                                    1.2.2.post1
cachetools                               5.5.0
certifi                                  2024.8.30
cffi                                     1.17.1
charset-normalizer                       3.4.0
chroma-hnswlib                           0.7.6
chromadb                                 0.5.20
click                                    8.1.7
coloredlogs                              15.0.1
comm                                     0.2.2
cryptography                             43.0.3
dataclasses-json                         0.6.7
datasets                                 3.1.0
datasketch                               1.6.5
debugpy                                  1.8.9
decorator                                5.1.1
defusedxml                               0.7.1
Deprecated                               1.2.15
dill                                     0.3.8
distro                                   1.9.0
docopt                                   0.6.2
docstring_parser                         0.16
durationpy                               0.9
executing                                2.1.0
faiss-cpu                                1.9.0.post1
fastapi                                  0.115.5
fastjsonschema                           2.20.0
filelock                                 3.16.1
flatbuffers                              24.3.25
frozenlist                               1.5.0
fsspec                                   2024.9.0
func_timeout                             4.3.5
google-ai-generativelanguage             0.6.10
google-api-core                          2.23.0
google-api-python-client                 2.154.0
google-auth                              2.36.0
google-auth-httplib2                     0.2.0
google-cloud-aiplatform                  1.73.0
google-cloud-bigquery                    3.27.0
google-cloud-core                        2.4.1
google-cloud-resource-manager            1.13.1
google-cloud-storage                     2.18.2
google-crc32c                            1.6.0
google-generativeai                      0.8.3
google-resumable-media                   2.7.2
googleapis-common-protos                 1.66.0
greenlet                                 3.1.1
grpc-google-iam-v1                       0.13.1
grpcio                                   1.68.0
grpcio-status                            1.68.0
h11                                      0.14.0
httpcore                                 1.0.7
httplib2                                 0.22.0
httptools                                0.6.4
httpx                                    0.27.2
httpx-sse                                0.4.0
huggingface-hub                          0.26.2
humanfriendly                            10.0
idna                                     3.10
importlib_metadata                       8.5.0
importlib_resources                      6.4.5
ipykernel                                6.29.5
ipython                                  8.12.3
isodate                                  0.7.2
jedi                                     0.19.2
Jinja2                                   3.1.4
jiter                                    0.7.1
joblib                                   1.4.2
jsonpatch                                1.33
jsonpointer                              3.0.0
jsonschema                               4.23.0
jsonschema-specifications                2024.10.1
jupyter_client                           8.6.3
jupyter_core                             5.7.2
jupyterlab_pygments                      0.3.0
kubernetes                               31.0.0
langchain                                0.3.8
langchain-anthropic                      0.3.0
langchain-chroma                         0.1.4
langchain-community                      0.3.8
langchain-core                           0.3.21
langchain-google-genai                   2.0.5
langchain-google-vertexai                2.0.7
langchain-openai                         0.2.10
langchain-text-splitters                 0.3.2
langgraph                                0.2.53
langgraph-checkpoint                     2.0.6
langgraph-sdk                            0.1.36
langsmith                                0.1.146
markdown-it-py                           3.0.0
MarkupSafe                               3.0.2
marshmallow                              3.23.1
matplotlib-inline                        0.1.7
mdurl                                    0.1.2
mistune                                  3.0.2
mmh3                                     5.0.1
monotonic                                1.6
mpmath                                   1.3.0
msal                                     1.31.1
msal-extensions                          1.2.0
msgpack                                  1.1.0
multidict                                6.1.0
multiprocess                             0.70.16
mypy-extensions                          1.0.0
nbclient                                 0.10.0
nbconvert                                7.16.4
nbformat                                 5.10.4
nest-asyncio                             1.6.0
networkx                                 3.4.2
numpy                                    1.26.4
nvidia-cublas-cu12                       12.4.5.8
nvidia-cuda-cupti-cu12                   12.4.127
nvidia-cuda-nvrtc-cu12                   12.4.127
nvidia-cuda-runtime-cu12                 12.4.127
nvidia-cudnn-cu12                        9.1.0.70
nvidia-cufft-cu12                        11.2.1.3
nvidia-curand-cu12                       10.3.5.147
nvidia-cusolver-cu12                     11.6.1.9
nvidia-cusparse-cu12                     12.3.1.170
nvidia-nccl-cu12                         2.21.5
nvidia-nvjitlink-cu12                    12.4.127
nvidia-nvtx-cu12                         12.4.127
oauthlib                                 3.2.2
onnxruntime                              1.20.1
openai                                   1.55.1
opentelemetry-api                        1.28.2
opentelemetry-exporter-otlp-proto-common 1.28.2
opentelemetry-exporter-otlp-proto-grpc   1.28.2
opentelemetry-instrumentation            0.49b2
opentelemetry-instrumentation-asgi       0.49b2
opentelemetry-instrumentation-fastapi    0.49b2
opentelemetry-proto                      1.28.2
opentelemetry-sdk                        1.28.2
opentelemetry-semantic-conventions       0.49b2
opentelemetry-util-http                  0.49b2
orjson                                   3.10.12
overrides                                7.7.0
packaging                                24.2
pandas                                   2.2.3
pandocfilters                            1.5.1
parso                                    0.8.4
pexpect                                  4.9.0
pickleshare                              0.7.5
pillow                                   11.0.0
pip                                      24.2
pip-chill                                1.0.3
pipreqs                                  0.5.0
platformdirs                             4.3.6
portalocker                              2.10.1
posthog                                  3.7.3
prompt_toolkit                           3.0.48
propcache                                0.2.0
proto-plus                               1.25.0
protobuf                                 5.28.3
psutil                                   6.1.0
ptyprocess                               0.7.0
pure_eval                                0.2.3
pyarrow                                  18.1.0
pyasn1                                   0.6.1
pyasn1_modules                           0.4.1
pycparser                                2.22
pydantic                                 2.9.0
pydantic_core                            2.23.2
pydantic-settings                        2.6.1
Pygments                                 2.18.0
PyJWT                                    2.10.0
pyparsing                                3.2.0
PyPika                                   0.48.9
pyproject_hooks                          1.2.0
python-dateutil                          2.9.0.post0
python-dotenv                            1.0.1
pytz                                     2024.2
PyYAML                                   6.0.2
pyzmq                                    26.2.0
referencing                              0.35.1
regex                                    2024.11.6
requests                                 2.32.3
requests-oauthlib                        2.0.0
requests-toolbelt                        1.0.0
rich                                     13.9.4
rpds-py                                  0.21.0
rsa                                      4.9
safetensors                              0.4.5
scikit-learn                             1.5.2
scipy                                    1.14.1
sentence-transformers                    3.3.1
setuptools                               75.1.0
shapely                                  2.0.6
shellingham                              1.5.4
six                                      1.16.0
sniffio                                  1.3.1
soupsieve                                2.6
SQLAlchemy                               2.0.35
sqlglot                                  25.32.0
sqlvalidator                             0.0.20
stack-data                               0.6.3
starlette                                0.41.3
sympy                                    1.13.1
tenacity                                 9.0.0
threadpoolctl                            3.5.0
tiktoken                                 0.8.0
tinycss2                                 1.4.0
tokenizers                               0.20.4
torch                                    2.5.1
tornado                                  6.4.2
tqdm                                     4.67.1
traitlets                                5.14.3
transformers                             4.46.3
triton                                   3.1.0
typer                                    0.13.1
typing_extensions                        4.12.2
typing-inspect                           0.9.0
tzdata                                   2024.2
uritemplate                              4.1.1
urllib3                                  2.2.3
uvicorn                                  0.32.1
uvloop                                   0.21.0
watchfiles                               1.0.0
wcwidth                                  0.2.13
webencodings                             0.5.1
websocket-client                         1.8.0
websockets                               14.1
wheel                                    0.44.0
wrapt                                    1.17.0
xxhash                                   3.5.0
yarg                                     0.1.9
yarl                                     1.18.0
zipp                                     3.21.0

### Error Message and Stack Trace (if applicable)

Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/jomoor1/code/Users/jomoor/CHESS-main/CHESS-main/./src/preprocess.py", line 59, in <module>
    worker_initializer(args.db_id, args)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/jomoor1/code/Users/jomoor/CHESS-main/CHESS-main/./src/preprocess.py", line 33, in worker_initializer
    make_db_context_vec_db(db_directory_path,
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/jomoor1/code/Users/jomoor/CHESS-main/CHESS-main/src/database_utils/db_catalog/preprocess.py", line 90, in make_db_context_vec_db
    Chroma.from_documents(docs, EMBEDDING_FUNCTION, persist_directory=str(vector_db_path))
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/langchain_chroma/vectorstores.py", line 1128, in from_documents
    return cls.from_texts(
           ^^^^^^^^^^^^^^^
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/langchain_chroma/vectorstores.py", line 1061, in from_texts
    chroma_collection = cls(
                        ^^^^
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/langchain_chroma/vectorstores.py", line 313, in __init__
    self._client = chromadb.Client(_client_settings)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/__init__.py", line 334, in Client
    return ClientCreator(tenant=tenant, database=database, settings=settings)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/api/client.py", line 58, in __init__
    super().__init__(settings=settings)
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/api/shared_system_client.py", line 19, in __init__
    SharedSystemClient._create_system_if_not_exists(self._identifier, settings)
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/api/shared_system_client.py", line 32, in _create_system_if_not_exists
    new_system.start()
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/config.py", line 444, in start
    component.start()
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 150, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/db/impl/sqlite.py", line 104, in start
    self.initialize_migrations()
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/db/migrations.py", line 140, in initialize_migrations
    self.apply_migrations()
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 150, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/db/migrations.py", line 167, in apply_migrations
    db_migrations = self.db_migrations(dir)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 150, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/db/impl/sqlite.py", line 202, in db_migrations
    cur.execute(
sqlite3.DatabaseError: database disk image is malformed

### Description

Chroma.from_documents() appears to be failing when trying to write persisted files. I have tried using the latest python libraries. I have also tried on different python versions (3.10, 3.11, 3.12) all with the same error. The only thing I am doing that is different than standard setups is defining my own embedding object and passing it in. I made sure it had the right method (embed_documents). I found two files in the output persisted folder. Here are the contents

00001-embeddings.sqlite.sql

```CREATE TABLE embeddings_queue (
    seq_id INTEGER PRIMARY KEY,
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    operation INTEGER NOT NULL,
    topic TEXT NOT NULL,
    id TEXT NOT NULL,
    vector BLOB,
    encoding TEXT,
    metadata TEXT
);

00002-embeddingsj-queue-config.sqlite.sql

CREATE TABLE embeddings_queue_config (
    id INTEGER PRIMARY KEY,
    config_json_str TEXT
);

System Info

OS: Linux OS Version: #82~20.04.1-Ubuntu SMP Tue Sep 3 12:27:43 UTC 2024 Python Version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct 4 2024, 13:27:36) [GCC 11.2.0]

Package Information

langchain_core: 0.3.21 langchain: 0.3.8 langchain_community: 0.3.8 langsmith: 0.1.146 langchain_anthropic: 0.3.0 langchain_chroma: 0.1.4 langchain_google_genai: 2.0.5 langchain_google_vertexai: 2.0.7 langchain_openai: 0.2.10 langchain_text_splitters: 0.3.2 langgraph_sdk: 0.1.36

Optional packages not installed

langserve

Other Dependencies

aiohttp: 3.11.7 anthropic: 0.39.0 anthropic[vertexai]: Installed. No version info available. async-timeout: Installed. No version info available. chromadb: 0.5.20 dataclasses-json: 0.6.7 defusedxml: 0.7.1 fastapi: 0.115.5 google-cloud-aiplatform: 1.73.0 google-cloud-storage: 2.18.2 google-generativeai: 0.8.3 httpx: 0.27.2 httpx-sse: 0.4.0 jsonpatch: 1.33 langchain-mistralai: Installed. No version info available. numpy: 1.26.4 openai: 1.55.1 orjson: 3.10.12 packaging: 24.2 pydantic: 2.9.0 pydantic-settings: 2.6.1 PyYAML: 6.0.2 requests: 2.32.3 requests-toolbelt: 1.0.0 SQLAlchemy: 2.0.35 tenacity: 9.0.0 tiktoken: 0.8.0 typing-extensions: 4.12.2

moore269 commented 2 hours ago

I just tried this on windows. It works on windows platform, but not on linux. I'm not too sure why, but maybe it has something to do with differing sqlite installs?