[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
Below is the sample code
class PromptOperator:
def embed_documents(self, text):
try:
token_provider = get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")
except Exception as e:
print(f"\033[91mError getting token provider: {e}\033[0m")
sys.exit(1)
client = AzureOpenAI(
azure_endpoint = API_BASE,
azure_ad_token_provider=token_provider,
api_version = VERSION,
)
response = client.embeddings.create(input = text,
model= "text-embedding-3-small")
results = [emb.embedding for emb in response.data]
return results
def embed_query(self, text):
result = self.embed_documents([text])
return result[0]
po = PromptOperator()
EMBEDDING_FUNCTION = po
vector_db_path = Path(db_directory_path) / "context_vector_db"
if vector_db_path.exists():
os.system(f"rm -r {vector_db_path}")
time.sleep(5)
vector_db_path.mkdir(exist_ok=True)
set_permissions(str(vector_db_path), 0o755)
Chroma.from_documents(docs, EMBEDDING_FUNCTION, persist_directory=str(vector_db_path))
docs looks something like this
Document(metadata={'table_name': 'account', 'original_column_name': 'account_id', 'column_name': 'account id', 'column_description': 'the id of the account', 'value_description': ''}, page_content='the id of the account'),
...]```
Lastly, here is a dump of the current packages installed. I am also on **sqlite3.version=3.45.3**
Package Version
---------------------------------------- -----------
aiohappyeyeballs 2.4.3
aiohttp 3.11.7
aiosignal 1.3.1
aiosqlite 0.20.0
annotated-types 0.7.0
anthropic 0.39.0
anyio 4.6.2.post1
asgiref 3.8.1
asttokens 2.4.1
attrs 24.2.0
azure-core 1.32.0
azure-identity 1.19.0
azure-storage-blob 12.24.0
backcall 0.2.0
backoff 2.2.1
bcrypt 4.2.1
beautifulsoup4 4.12.3
bleach 6.2.0
build 1.2.2.post1
cachetools 5.5.0
certifi 2024.8.30
cffi 1.17.1
charset-normalizer 3.4.0
chroma-hnswlib 0.7.6
chromadb 0.5.20
click 8.1.7
coloredlogs 15.0.1
comm 0.2.2
cryptography 43.0.3
dataclasses-json 0.6.7
datasets 3.1.0
datasketch 1.6.5
debugpy 1.8.9
decorator 5.1.1
defusedxml 0.7.1
Deprecated 1.2.15
dill 0.3.8
distro 1.9.0
docopt 0.6.2
docstring_parser 0.16
durationpy 0.9
executing 2.1.0
faiss-cpu 1.9.0.post1
fastapi 0.115.5
fastjsonschema 2.20.0
filelock 3.16.1
flatbuffers 24.3.25
frozenlist 1.5.0
fsspec 2024.9.0
func_timeout 4.3.5
google-ai-generativelanguage 0.6.10
google-api-core 2.23.0
google-api-python-client 2.154.0
google-auth 2.36.0
google-auth-httplib2 0.2.0
google-cloud-aiplatform 1.73.0
google-cloud-bigquery 3.27.0
google-cloud-core 2.4.1
google-cloud-resource-manager 1.13.1
google-cloud-storage 2.18.2
google-crc32c 1.6.0
google-generativeai 0.8.3
google-resumable-media 2.7.2
googleapis-common-protos 1.66.0
greenlet 3.1.1
grpc-google-iam-v1 0.13.1
grpcio 1.68.0
grpcio-status 1.68.0
h11 0.14.0
httpcore 1.0.7
httplib2 0.22.0
httptools 0.6.4
httpx 0.27.2
httpx-sse 0.4.0
huggingface-hub 0.26.2
humanfriendly 10.0
idna 3.10
importlib_metadata 8.5.0
importlib_resources 6.4.5
ipykernel 6.29.5
ipython 8.12.3
isodate 0.7.2
jedi 0.19.2
Jinja2 3.1.4
jiter 0.7.1
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
jupyter_client 8.6.3
jupyter_core 5.7.2
jupyterlab_pygments 0.3.0
kubernetes 31.0.0
langchain 0.3.8
langchain-anthropic 0.3.0
langchain-chroma 0.1.4
langchain-community 0.3.8
langchain-core 0.3.21
langchain-google-genai 2.0.5
langchain-google-vertexai 2.0.7
langchain-openai 0.2.10
langchain-text-splitters 0.3.2
langgraph 0.2.53
langgraph-checkpoint 2.0.6
langgraph-sdk 0.1.36
langsmith 0.1.146
markdown-it-py 3.0.0
MarkupSafe 3.0.2
marshmallow 3.23.1
matplotlib-inline 0.1.7
mdurl 0.1.2
mistune 3.0.2
mmh3 5.0.1
monotonic 1.6
mpmath 1.3.0
msal 1.31.1
msal-extensions 1.2.0
msgpack 1.1.0
multidict 6.1.0
multiprocess 0.70.16
mypy-extensions 1.0.0
nbclient 0.10.0
nbconvert 7.16.4
nbformat 5.10.4
nest-asyncio 1.6.0
networkx 3.4.2
numpy 1.26.4
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
oauthlib 3.2.2
onnxruntime 1.20.1
openai 1.55.1
opentelemetry-api 1.28.2
opentelemetry-exporter-otlp-proto-common 1.28.2
opentelemetry-exporter-otlp-proto-grpc 1.28.2
opentelemetry-instrumentation 0.49b2
opentelemetry-instrumentation-asgi 0.49b2
opentelemetry-instrumentation-fastapi 0.49b2
opentelemetry-proto 1.28.2
opentelemetry-sdk 1.28.2
opentelemetry-semantic-conventions 0.49b2
opentelemetry-util-http 0.49b2
orjson 3.10.12
overrides 7.7.0
packaging 24.2
pandas 2.2.3
pandocfilters 1.5.1
parso 0.8.4
pexpect 4.9.0
pickleshare 0.7.5
pillow 11.0.0
pip 24.2
pip-chill 1.0.3
pipreqs 0.5.0
platformdirs 4.3.6
portalocker 2.10.1
posthog 3.7.3
prompt_toolkit 3.0.48
propcache 0.2.0
proto-plus 1.25.0
protobuf 5.28.3
psutil 6.1.0
ptyprocess 0.7.0
pure_eval 0.2.3
pyarrow 18.1.0
pyasn1 0.6.1
pyasn1_modules 0.4.1
pycparser 2.22
pydantic 2.9.0
pydantic_core 2.23.2
pydantic-settings 2.6.1
Pygments 2.18.0
PyJWT 2.10.0
pyparsing 3.2.0
PyPika 0.48.9
pyproject_hooks 1.2.0
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
pytz 2024.2
PyYAML 6.0.2
pyzmq 26.2.0
referencing 0.35.1
regex 2024.11.6
requests 2.32.3
requests-oauthlib 2.0.0
requests-toolbelt 1.0.0
rich 13.9.4
rpds-py 0.21.0
rsa 4.9
safetensors 0.4.5
scikit-learn 1.5.2
scipy 1.14.1
sentence-transformers 3.3.1
setuptools 75.1.0
shapely 2.0.6
shellingham 1.5.4
six 1.16.0
sniffio 1.3.1
soupsieve 2.6
SQLAlchemy 2.0.35
sqlglot 25.32.0
sqlvalidator 0.0.20
stack-data 0.6.3
starlette 0.41.3
sympy 1.13.1
tenacity 9.0.0
threadpoolctl 3.5.0
tiktoken 0.8.0
tinycss2 1.4.0
tokenizers 0.20.4
torch 2.5.1
tornado 6.4.2
tqdm 4.67.1
traitlets 5.14.3
transformers 4.46.3
triton 3.1.0
typer 0.13.1
typing_extensions 4.12.2
typing-inspect 0.9.0
tzdata 2024.2
uritemplate 4.1.1
urllib3 2.2.3
uvicorn 0.32.1
uvloop 0.21.0
watchfiles 1.0.0
wcwidth 0.2.13
webencodings 0.5.1
websocket-client 1.8.0
websockets 14.1
wheel 0.44.0
wrapt 1.17.0
xxhash 3.5.0
yarg 0.1.9
yarl 1.18.0
zipp 3.21.0
### Error Message and Stack Trace (if applicable)
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/jomoor1/code/Users/jomoor/CHESS-main/CHESS-main/./src/preprocess.py", line 59, in <module>
worker_initializer(args.db_id, args)
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/jomoor1/code/Users/jomoor/CHESS-main/CHESS-main/./src/preprocess.py", line 33, in worker_initializer
make_db_context_vec_db(db_directory_path,
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/jomoor1/code/Users/jomoor/CHESS-main/CHESS-main/src/database_utils/db_catalog/preprocess.py", line 90, in make_db_context_vec_db
Chroma.from_documents(docs, EMBEDDING_FUNCTION, persist_directory=str(vector_db_path))
File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/langchain_chroma/vectorstores.py", line 1128, in from_documents
return cls.from_texts(
^^^^^^^^^^^^^^^
File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/langchain_chroma/vectorstores.py", line 1061, in from_texts
chroma_collection = cls(
^^^^
File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/langchain_chroma/vectorstores.py", line 313, in __init__
self._client = chromadb.Client(_client_settings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/__init__.py", line 334, in Client
return ClientCreator(tenant=tenant, database=database, settings=settings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/api/client.py", line 58, in __init__
super().__init__(settings=settings)
File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/api/shared_system_client.py", line 19, in __init__
SharedSystemClient._create_system_if_not_exists(self._identifier, settings)
File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/api/shared_system_client.py", line 32, in _create_system_if_not_exists
new_system.start()
File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/config.py", line 444, in start
component.start()
File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 150, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/db/impl/sqlite.py", line 104, in start
self.initialize_migrations()
File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/db/migrations.py", line 140, in initialize_migrations
self.apply_migrations()
File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 150, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/db/migrations.py", line 167, in apply_migrations
db_migrations = self.db_migrations(dir)
^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 150, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/anaconda/envs/chess_engine_3.12/lib/python3.12/site-packages/chromadb/db/impl/sqlite.py", line 202, in db_migrations
cur.execute(
sqlite3.DatabaseError: database disk image is malformed
### Description
Chroma.from_documents() appears to be failing when trying to write persisted files. I have tried using the latest python libraries. I have also tried on different python versions (3.10, 3.11, 3.12) all with the same error. The only thing I am doing that is different than standard setups is defining my own embedding object and passing it in. I made sure it had the right method (embed_documents). I found two files in the output persisted folder. Here are the contents
00001-embeddings.sqlite.sql
```CREATE TABLE embeddings_queue (
seq_id INTEGER PRIMARY KEY,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
operation INTEGER NOT NULL,
topic TEXT NOT NULL,
id TEXT NOT NULL,
vector BLOB,
encoding TEXT,
metadata TEXT
);
00002-embeddingsj-queue-config.sqlite.sql
CREATE TABLE embeddings_queue_config (
id INTEGER PRIMARY KEY,
config_json_str TEXT
);
System Info
OS: Linux
OS Version: #82~20.04.1-Ubuntu SMP Tue Sep 3 12:27:43 UTC 2024
Python Version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct 4 2024, 13:27:36) [GCC 11.2.0]
I just tried this on windows. It works on windows platform, but not on linux. I'm not too sure why, but maybe it has something to do with differing sqlite installs?
Checked other resources
Example Code
Below is the sample code
docs looks something like this
00002-embeddingsj-queue-config.sqlite.sql
System Info
Package Information
Optional packages not installed
Other Dependencies