bug: chromadb error while adding file

YunchengLiang commented 1 year ago

🐛 Describe the bug

--- Logging error ---
Traceback (most recent call last):
  File "c:\Users\yuncheng.liang\Anaconda3\lib\site-packages\chromadb\db\mixins\embeddings_queue.py", line 263, in _notify_one
    sub.callback([embedding])
  File "c:\Users\yuncheng.liang\Anaconda3\lib\site-packages\chromadb\segment\impl\vector\local_persistent_hnsw.py", line 210, in _write_records
    self._ensure_index(len(records), len(record["embedding"]))
  File "c:\Users\yuncheng.liang\Anaconda3\lib\site-packages\chromadb\segment\impl\vector\local_hnsw.py", line 205, in _ensure_index
    self._init_index(dim)
  File "c:\Users\yuncheng.liang\Anaconda3\lib\site-packages\chromadb\segment\impl\vector\local_persistent_hnsw.py", line 157, in _init_index
    index.init_index(
TypeError: init_index(): incompatible function arguments. The following argument types are supported:
    1. (self: hnswlib.Index, max_elements: int, M: int = 16, ef_construction: int = 200, random_seed: int = 100, allow_replace_deleted: bool = False) -> None

Invoked with: <hnswlib.Index(space='l2', dim=384)>; kwargs: max_elements=1000, ef_construction=100, M=16, is_persistent_index=True, persistence_location='db\\3f7b5db5-57f5-4bb3-af49-9e020de70739'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\Users\yuncheng.liang\Anaconda3\lib\logging\__init__.py", line 1083, in emit
    msg = self.format(record)
  File "c:\Users\yuncheng.liang\Anaconda3\lib\logging\__init__.py", line 927, in format
    return fmt.format(record)
  File "c:\Users\yuncheng.liang\Anaconda3\lib\logging\__init__.py", line 663, in format
    record.message = record.getMessage()
  File "c:\Users\yuncheng.liang\Anaconda3\lib\logging\__init__.py", line 367, in getMessage
...
logger.error(
Message: 'Exception occurred invoking consumer for subscription 21ed8025-5ae8-47b0-b225-18a08636094dto topic persistent://default/default/618da312-a3c3-41bd-b3e9-e9bb75b1ce8b for embedding id a1558dcb760eb504a9344634d22ecc90d00d9fcf12623b0ae0fa5085a22c2768 '
Arguments: (TypeError("init_index(): incompatible function arguments. The following argument types are supported:\n    1. (self: hnswlib.Index, max_elements: int, M: int = 16, ef_construction: int = 200, random_seed: int = 100, allow_replace_deleted: bool = False) -> None\n\nInvoked with: <hnswlib.Index(space='l2', dim=384)>; kwargs: max_elements=1000, ef_construction=100, M=16, is_persistent_index=True, persistence_location='db\\\\3f7b5db5-57f5-4bb3-af49-9e020de70739'"),)
--- Logging error ---
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?a2eae853-401c-477e-ab2f-3ed3921ad812) or open in a [text editor](command:workbench.action.openLargeOutput?a2eae853-401c-477e-ab2f-3ed3921ad812). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...
Successfully saved https://www.rogers.com/cms/pdf/en/Consumer_SUG_V20.pdf. New chunks count: 200
Traceback (most recent call last):
  File "c:\Users\yuncheng.liang\Anaconda3\lib\site-packages\chromadb\db\mixins\embeddings_queue.py", line 263, in _notify_one
    sub.callback([embedding])
  File "c:\Users\yuncheng.liang\Anaconda3\lib\site-packages\chromadb\segment\impl\vector\local_persistent_hnsw.py", line 210, in _write_records
    self._ensure_index(len(records), len(record["embedding"]))
  File "c:\Users\yuncheng.liang\Anaconda3\lib\site-packages\chromadb\segment\impl\vector\local_hnsw.py", line 205, in _ensure_index
    self._init_index(dim)
  File "c:\Users\yuncheng.liang\Anaconda3\lib\site-packages\chromadb\segment\impl\vector\local_persistent_hnsw.py", line 157, in _init_index
    index.init_index(
TypeError: init_index(): incompatible function arguments. The following argument types are supported:
    1. (self: hnswlib.Index, max_elements: int, M: int = 16, ef_construction: int = 200, random_seed: int = 100, allow_replace_deleted: bool = False) -> None

Invoked with: <hnswlib.Index(space='l2', dim=384)>; kwargs: max_elements=1000, ef_construction=100, M=16, is_persistent_index=True, persistence_location='db\\3f7b5db5-57f5-4bb3-af49-9e020de70739'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\Users\yuncheng.liang\Anaconda3\lib\logging\__init__.py", line 1083, in emit
    msg = self.format(record)
  File "c:\Users\yuncheng.liang\Anaconda3\lib\logging\__init__.py", line 927, in format
    return fmt.format(record)
  File "c:\Users\yuncheng.liang\Anaconda3\lib\logging\__init__.py", line 663, in format
    record.message = record.getMessage()
  File "c:\Users\yuncheng.liang\Anaconda3\lib\logging\__init__.py", line 367, in getMessage
    msg = msg % self.args
...
  File "c:\Users\yuncheng.liang\Anaconda3\lib\site-packages\chromadb\db\mixins\embeddings_queue.py", line 266, in _notify_one
    logger.error(
Message: 'Exception occurred invoking consumer for subscription 21ed8025-5ae8-47b0-b225-18a08636094dto topic persistent://default/default/618da312-a3c3-41bd-b3e9-e9bb75b1ce8b for embedding id 6a0c886d51ac939eba5a8d308cd6bf6430f7370120814606863b447cce5687f4 '
**Arguments: (TypeError("init_index(): incompatible function arguments. The following argument types are supported:\n    1. (self: hnswlib.Index, max_elements: int, M: int = 16, ef_construction: int = 200, random_seed: int = 100, allow_replace_deleted: bool = False) -> None\n\nInvoked with: <hnswlib.Index(space='l2', dim=384)>; kwargs: max_elements=1000, ef_construction=100, M=16, is_persistent_index=True, persistence_location='db\\\\3f7b5db5-57f5-4bb3-af49-9e020de70739'"),)**

YunchengLiang commented 1 year ago

This is the latest release 0.0.30

cachho commented 1 year ago

Can you provide your code to reproduce?

YunchengLiang commented 1 year ago

Can you provide your code to reproduce?

Yes for sure!

from embedchain import OpenSourceApp
chat_bot = OpenSourceApp()

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
def huggingface_tokenizer_length(text: str) -> int:
    return len(tokenizer.encode(text))

from embedchain.config import  AddConfig, ChunkerConfig
chunker_config = ChunkerConfig(chunk_size=230, chunk_overlap=20, length_function=huggingface_tokenizer_length)

pdf_url = 'https://www.rogers.com/cms/pdf/en/Consumer_SUG_V20.pdf' #online resources
chat_bot.add('pdf_file', pdf_url, config=AddConfig(chunker=chunker_config))

cachho commented 1 year ago

I can't reproduce. It works for me.

def use_pysqlite3():
    """
    Swap std-lib sqlite3 with pysqlite3.
    """
    import subprocess
    import sys

    subprocess.check_call([sys.executable, "-m", "pip", "install", "pysqlite3-binary"])

    __import__("pysqlite3")
    sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")
    # Don't be surprised if this doesn't log as you expect, because the logger is instantiated after the import

use_pysqlite3()

from embedchain import OpenSourceApp
chat_bot = OpenSourceApp()
chat_bot.reset()
chat_bot = OpenSourceApp()

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
def huggingface_tokenizer_length(text: str) -> int:
    return len(tokenizer.encode(text))

from embedchain.config import  AddConfig, ChunkerConfig
chunker_config = ChunkerConfig(chunk_size=230, chunk_overlap=20, length_function=huggingface_tokenizer_length)

pdf_url = 'https://www.rogers.com/cms/pdf/en/Consumer_SUG_V20.pdf' #online resources
chat_bot.add('pdf_file', pdf_url, config=AddConfig(chunker=chunker_config))

maybe the reset is the key?

YunchengLiang commented 1 year ago

I can't reproduce. It works for me.

def use_pysqlite3():
    """
    Swap std-lib sqlite3 with pysqlite3.
    """
    import subprocess
    import sys

    subprocess.check_call([sys.executable, "-m", "pip", "install", "pysqlite3-binary"])

    __import__("pysqlite3")
    sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")
    # Don't be surprised if this doesn't log as you expect, because the logger is instantiated after the import

use_pysqlite3()

from embedchain import OpenSourceApp
chat_bot = OpenSourceApp()
chat_bot.reset()
chat_bot = OpenSourceApp()

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
def huggingface_tokenizer_length(text: str) -> int:
    return len(tokenizer.encode(text))

from embedchain.config import  AddConfig, ChunkerConfig
chunker_config = ChunkerConfig(chunk_size=230, chunk_overlap=20, length_function=huggingface_tokenizer_length)

pdf_url = 'https://www.rogers.com/cms/pdf/en/Consumer_SUG_V20.pdf' #online resources
chat_bot.add('pdf_file', pdf_url, config=AddConfig(chunker=chunker_config))

maybe the reset is the key?

I tried the reset but seems it is not the problem. I have some questions

Are we using the same libraries ( chromadb-0.4.2 embedchain-0.0.30) ?
I have this appears under db every time I initiate the app using this new release of embedchain. Does this imply anything?

The error I get is:

TypeError: init_index(): incompatible function arguments. The following argument types are supported:
1. (self: hnswlib.Index, max_elements: int, M: int = 16, ef_construction: int = 200, random_seed: int = 100, allow_replace_deleted: bool = False) -> None   
Invoked with: <hnswlib.Index(space='l2', dim=384)>; kwargs: max_elements=1000, ef_construction=100, M=16, is_persistent_index=True, persistence_location='db\\89cdd1e6-5c9c-49f0-9b16-96170d70598a'

Does this error suggest that arguments "is_persistent_index" and "persistence_location" are not supported arguments?

cachho commented 1 year ago

Are we using the same libraries ( chromadb-0.4.2 embedchain-0.0.30) ?

I'm using the main branch of this repository. I don't think that makes the difference.

cachho commented 1 year ago

The error I get is

Is this the whole error? Where is the traceback.

I guess we need someone else to try to reporduce this, as I said I can't reproduce your error, I don't know which package is throwing the error and if this even an embedchain issue.

cachho commented 1 year ago

I get a new error now when I run your file

from embedchain import OpenSourceApp
from embedchain.config import OpenSourceAppConfig

config = OpenSourceAppConfig(log_level="DEBUG")
chat_bot = OpenSourceApp(config=config)
chat_bot.reset()
chat_bot = OpenSourceApp(config=config)

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def huggingface_tokenizer_length(text: str) -> int:
    return len(tokenizer.encode(text))

from embedchain.config import AddConfig, ChunkerConfig

chunker_config = ChunkerConfig(chunk_size=230, chunk_overlap=20, length_function=huggingface_tokenizer_length)

pdf_url = "https://www.rogers.com/cms/pdf/en/Consumer_SUG_V20.pdf"  # online resources
chat_bot.add(pdf_url, config=AddConfig(chunker=chunker_config))

error:

ERROR:chromadb.telemetry.posthog:Failed to send telemetry event client_start: module 'chromadb' has no attribute 'get_settings'
Found model file at  /home/carl/.cache/gpt4all/orca-mini-3b.ggmlv3.q4_0.bin
llama.cpp: loading model from /home/carl/.cache/gpt4all/orca-mini-3b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 3200
llama_model_load_internal: n_mult     = 240
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 26
llama_model_load_internal: n_rot      = 100
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 8640
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size =    0.06 MB
llama_model_load_internal: mem required  = 2862.72 MB (+  682.00 MB per state)
llama_new_context_with_model: kv self size  =  650.00 MB
Found model file at  /home/carl/.cache/gpt4all/orca-mini-3b.ggmlv3.q4_0.bin
llama.cpp: loading model from /home/carl/.cache/gpt4all/orca-mini-3b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 3200
llama_model_load_internal: n_mult     = 240
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 26
llama_model_load_internal: n_rot      = 100
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 8640
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size =    0.06 MB
llama_model_load_internal: mem required  = 2862.72 MB (+  682.00 MB per state)
llama_new_context_with_model: kv self size  =  650.00 MB
Traceback (most recent call last):
  File "/home/carl/code/embedchain/test.py", line 24, in <module>
    chat_bot.add(pdf_url, config=AddConfig(chunker=chunker_config))
  File "/home/carl/code/embedchain/embedchain/embedchain.py", line 62, in add
    self.load_and_embed(data_formatter.loader, data_formatter.chunker, source, metadata)
  File "/home/carl/code/embedchain/embedchain/embedchain.py", line 86, in load_and_embed
    existing_docs = self.collection.get(
                    ^^^^^^^^^^^^^^^^^^^^
  File "/home/carl/code/embedchain/.venv/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 134, in get
    return self._client._get(
           ^^^^^^^^^^^^^^^^^^
  File "/home/carl/code/embedchain/.venv/lib/python3.11/site-packages/chromadb/api/segment.py", line 312, in _get
    metadata_segment = self._manager.get_segment(collection_id, MetadataReader)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/carl/code/embedchain/.venv/lib/python3.11/site-packages/chromadb/segment/impl/manager/local.py", line 106, in get_segment
    segment = next(filter(lambda s: s["type"] in known_types, segments))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
StopIteration

This is using an unmerged PR. I'm just posting this to keep an eye on it, because it works with the normal App. edit: tested in main, same error.

YunchengLiang commented 1 year ago

The error I get is

Is this the whole error? Where is the traceback.

I guess we need someone else to try to reporduce this, as I said I can't reproduce your error, I don't know which package is throwing the error and if this even an embedchain issue.

yeah.... i will ask one of my coworker to do the same and see what happens, will come back to this thread asap

YunchengLiang commented 1 year ago

I get a new error now when I run your file

from embedchain import OpenSourceApp
from embedchain.config import OpenSourceAppConfig

config = OpenSourceAppConfig(log_level="DEBUG")
chat_bot = OpenSourceApp(config=config)
chat_bot.reset()
chat_bot = OpenSourceApp(config=config)

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def huggingface_tokenizer_length(text: str) -> int:
    return len(tokenizer.encode(text))

from embedchain.config import AddConfig, ChunkerConfig

chunker_config = ChunkerConfig(chunk_size=230, chunk_overlap=20, length_function=huggingface_tokenizer_length)

pdf_url = "https://www.rogers.com/cms/pdf/en/Consumer_SUG_V20.pdf"  # online resources
chat_bot.add(pdf_url, config=AddConfig(chunker=chunker_config))

error:

ERROR:chromadb.telemetry.posthog:Failed to send telemetry event client_start: module 'chromadb' has no attribute 'get_settings'
Found model file at  /home/carl/.cache/gpt4all/orca-mini-3b.ggmlv3.q4_0.bin
llama.cpp: loading model from /home/carl/.cache/gpt4all/orca-mini-3b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 3200
llama_model_load_internal: n_mult     = 240
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 26
llama_model_load_internal: n_rot      = 100
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 8640
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size =    0.06 MB
llama_model_load_internal: mem required  = 2862.72 MB (+  682.00 MB per state)
llama_new_context_with_model: kv self size  =  650.00 MB
Found model file at  /home/carl/.cache/gpt4all/orca-mini-3b.ggmlv3.q4_0.bin
llama.cpp: loading model from /home/carl/.cache/gpt4all/orca-mini-3b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 3200
llama_model_load_internal: n_mult     = 240
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 26
llama_model_load_internal: n_rot      = 100
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 8640
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size =    0.06 MB
llama_model_load_internal: mem required  = 2862.72 MB (+  682.00 MB per state)
llama_new_context_with_model: kv self size  =  650.00 MB
Traceback (most recent call last):
  File "/home/carl/code/embedchain/test.py", line 24, in <module>
    chat_bot.add(pdf_url, config=AddConfig(chunker=chunker_config))
  File "/home/carl/code/embedchain/embedchain/embedchain.py", line 62, in add
    self.load_and_embed(data_formatter.loader, data_formatter.chunker, source, metadata)
  File "/home/carl/code/embedchain/embedchain/embedchain.py", line 86, in load_and_embed
    existing_docs = self.collection.get(
                    ^^^^^^^^^^^^^^^^^^^^
  File "/home/carl/code/embedchain/.venv/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 134, in get
    return self._client._get(
           ^^^^^^^^^^^^^^^^^^
  File "/home/carl/code/embedchain/.venv/lib/python3.11/site-packages/chromadb/api/segment.py", line 312, in _get
    metadata_segment = self._manager.get_segment(collection_id, MetadataReader)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/carl/code/embedchain/.venv/lib/python3.11/site-packages/chromadb/segment/impl/manager/local.py", line 106, in get_segment
    segment = next(filter(lambda s: s["type"] in known_types, segments))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
StopIteration

This is using an unmerged PR. I'm just posting this to keep an eye on it, because it works with the normal App. edit: tested in main, same error.

it works fine on my coworker's computer... guess something wrong with my configuration and such, i will try use a new conda kernel

taranjeet commented 10 months ago

Closing this as its very old.

mem0ai / mem0

bug: chromadb error while adding file #363

🐛 Describe the bug