thiswillbeyourgithub / wdoc

Summarize and query from a lot of heterogeneous documents. Any LLM provider, any filetype, scalable, under developpement
GNU General Public License v3.0
139 stars 12 forks source link

Error while using wdoc in WSL #7

Closed havr-p closed 1 week ago

havr-p commented 1 week ago

Hello, I got error when tried to use wdoc in WSL

tortuga@CONSULTIS-XMG3:/mnt/c/WINDOWS/system32$ wdoc --path="https://situational-awareness.ai/wp-content/uploads/2024/06/situationalawareness.pdf" --task=query --filetype="online_pdf" --query="What does it say about alphago?" --query_retrievers='default_multiquery' --top_k=auto_200_500
WARNING:langchain_community.utils.user_agent:USER_AGENT environment variable not set, consider setting it to identify your requests.
             _
__      ____| | ___   ___
\ \ /\ / / _` |/ _ \ / __|
 \ V  V / (_| | (_) | (__
  \_/\_/ \__,_|\___/ \___|

INFO:httpx:HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
Bypassing model name matching for model 'openai/gpt-4o'
Bypassing model name matching for model 'openai/gpt-4o-mini'
INFO:faiss.loader:Loading faiss with AVX2 support.
INFO:faiss.loader:Successfully loaded faiss with AVX2 support.
/home/tortuga/.local/lib/python3.11/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
  warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
Loading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.30doc/s]
Done loading all 1 documents in 0.44s
No document failed to load!
Deduplicating...
Getting all hash
Counting them
No duplicates!
Deduplicating:   0%|                                                                                                                                                                                                                | 0/163 [00:00<?, ?doc/s]
Traceback (most recent call last):
  File "/home/tortuga/.local/bin/wdoc", line 8, in <module>
    sys.exit(cli_launcher())
             ^^^^^^^^^^^^^^
  File "/home/tortuga/.local/lib/python3.11/site-packages/wdoc/__main__.py", line 69, in cli_launcher
    fire.Fire(wdoc)
  File "/home/tortuga/.local/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tortuga/.local/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/tortuga/.local/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "<@beartype(wdoc.wdoc.wdoc.__init__) at 0x7f59d7347740>", line 14, in __init__
  File "/home/tortuga/.local/lib/python3.11/site-packages/wdoc/utils/misc.py", line 701, in new_func
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tortuga/.local/lib/python3.11/site-packages/wdoc/wdoc.py", line 522, in __init__
    self.prepare_query_task()
  File "<@beartype(wdoc.wdoc.wdoc.prepare_query_task) at 0x7f59d73476a0>", line 13, in prepare_query_task
  File "/home/tortuga/.local/lib/python3.11/site-packages/wdoc/wdoc.py", line 924, in prepare_query_task
    self.loaded_embeddings, self.embeddings = load_embeddings(
                                              ^^^^^^^^^^^^^^^^
  File "<@beartype(wdoc.utils.embeddings.load_embeddings) at 0x7f59d74d47c0>", line 172, in load_embeddings
  File "/home/tortuga/.local/lib/python3.11/site-packages/wdoc/utils/embeddings.py", line 181, in load_embeddings
    lfs = LocalFileStore(
          ^^^^^^^^^^^^^^^
  File "/home/tortuga/.local/lib/python3.11/site-packages/wdoc/utils/customs/compressed_embeddings_cache.py", line 59, in __init__
    self.pdi = PersistDict(
               ^^^^^^^^^^^^
  File "<@beartype(PersistDict.PersistDict.PersistDict.__init__) at 0x7f59d7a2aca0>", line 207, in __init__
  File "/home/tortuga/.local/lib/python3.11/site-packages/PersistDict/PersistDict.py", line 100, in __init__
    self.__init_table__()
  File "<@beartype(PersistDict.PersistDict.PersistDict.__init_table__) at 0x7f59d7a2ad40>", line 12, in __init_table__
  File "/home/tortuga/.local/lib/python3.11/site-packages/PersistDict/PersistDict.py", line 147, in __init_table__
    conn = self.__connect__()
           ^^^^^^^^^^^^^^^^^^
  File "/home/tortuga/.local/lib/python3.11/site-packages/PersistDict/PersistDict.py", line 113, in __connect__
    return sqlite3.connect(
           ^^^^^^^^^^^^^^^^
sqlite3.OperationalError: unable to open database file

I followed standard way of installing wdoc through pip Do you know what can be the reason? Maybe I can help you to add some diagnostic tools to wdoc, like Sentry e.g.

thiswillbeyourgithub commented 1 week ago

Hi ! Thanks a lot for the interest and taking a look.

It appears I was maybe overly enhusiastic when creating PersistDict to work around langchain caching issues they have not fixed for months? It's basically just a sqlite db accessed like a python dict for conveniance to cache things.

If something goes wrong try deleting the cache folder, or disabling the cache (cf bottom of the --help). Use --debug to know where the cache is at.

Please tell me if that helped!

blinkenl1ghts commented 1 week ago

Hi! I ran into similar issue, wdoc fails to open sqlite database. Here are the logs:

__      ____| | ___   ___ 
\ \ /\ / / _` |/ _ \ / __|
 \ V  V / (_| | (_) | (__ 
  \_/\_/ \__,_|\___/ \___|

2024-11-02T00:33:43.514550+0100 INFO wdoc 8638483456 93318 printer 92 Bypassing model name matching for model 'openai/gpt-4o'
2024-11-02T00:33:43.514683+0100 INFO wdoc 8638483456 93318 printer 92 Bypassing model name matching for model 'openai/gpt-4o-mini'
2024-11-02T00:33:43.514834+0100 INFO wdoc 8638483456 93318 printer 92 Cache location: /Users/redacted/Library/Caches/wdoc
2024-11-02T00:33:43.514894+0100 INFO wdoc 8638483456 93318 printer 92 Log location: /Users/redacted/Library/Logs/wdoc
2024-11-02T00:33:43.514994+0100 INFO wdoc 8638483456 93318 printer 92 Loading model via litellm
2024-11-02T00:33:45.252845+0100 INFO wdoc 8638483456 93318 printer 92 Loading pdf: 'situationalawareness.pdf'
2024-11-02T00:33:45.320935+0100 INFO wdoc 8638483456 93318 printer 92 Trying to parse situationalawareness.pdf using pymupdf
2024-11-02T00:33:45.463420+0100 INFO wdoc 8638483456 93318 printer 92 Language probability after parsing situationalawareness.pdf: {'pymupdf': 0.8829599149299391}
2024-11-02T00:33:46.009417+0100 INFO wdoc 8638483456 93318 printer 92 Done loading all 1 documents in 0.76s
2024-11-02T00:33:46.009672+0100 INFO wdoc 8638483456 93318 printer 92 No document failed to load!
2024-11-02T00:33:46.009719+0100 INFO wdoc 8638483456 93318 printer 92 Deduplicating...
2024-11-02T00:33:46.009753+0100 INFO wdoc 8638483456 93318 printer 92 Getting all hash
2024-11-02T00:33:46.009799+0100 INFO wdoc 8638483456 93318 printer 92 Counting them
2024-11-02T00:33:46.009920+0100 INFO wdoc 8638483456 93318 printer 92 No duplicates!
2024-11-02T00:33:46.010211+0100 INFO wdoc 8638483456 93318 printer 92 Selected embedding model 'text-embedding-3-small' of backend openai
2024-11-02T00:33:46.020518+0100 DEBUG wdoc 8638483456 93318 _log 323 PersistDict:.__init__
2024-11-02T00:33:46.020735+0100 DEBUG wdoc 8638483456 93318 _log 323 PersistDict:.__init_table__
2024-11-02T00:33:46.020772+0100 DEBUG wdoc 8638483456 93318 _log 323 PersistDict:opening connection
2024-11-02T00:33:46.020881+0100 INFO wdoc 8638483456 93318 printer 92 
--verbose was used so opening debug console at the appropriate frame. Press 'c' to continue to the frame of this print.
2024-11-02T00:33:46.023422+0100 INFO wdoc 8638483456 93318 printer 92   File "/Users/redacted/.local/share/virtualenvs/wdoc_test-ePoOMSPS/bin/wdoc", line 8, in <module>
    sys.exit(cli_launcher())
             ^^^^^^^^^^^^^^

2024-11-02T00:33:46.023472+0100 INFO wdoc 8638483456 93318 printer 92   File "/Users/redacted/.local/share/virtualenvs/wdoc_test-ePoOMSPS/lib/python3.11/site-packages/wdoc/__main__.py", line 69, in cli_launcher
    fire.Fire(wdoc)

2024-11-02T00:33:46.023507+0100 INFO wdoc 8638483456 93318 printer 92   File "/Users/redacted/.local/share/virtualenvs/wdoc_test-ePoOMSPS/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

2024-11-02T00:33:46.023570+0100 INFO wdoc 8638483456 93318 printer 92   File "/Users/redacted/.local/share/virtualenvs/wdoc_test-ePoOMSPS/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^

2024-11-02T00:33:46.023601+0100 INFO wdoc 8638483456 93318 printer 92   File "/Users/redacted/.local/share/virtualenvs/wdoc_test-ePoOMSPS/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^

2024-11-02T00:33:46.023629+0100 INFO wdoc 8638483456 93318 printer 92   File "/Users/redacted/.local/share/virtualenvs/wdoc_test-ePoOMSPS/lib/python3.11/site-packages/wdoc/utils/misc.py", line 701, in new_func
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^

2024-11-02T00:33:46.023660+0100 INFO wdoc 8638483456 93318 printer 92   File "/Users/redacted/.local/share/virtualenvs/wdoc_test-ePoOMSPS/lib/python3.11/site-packages/wdoc/wdoc.py", line 522, in __init__
    self.prepare_query_task()

2024-11-02T00:33:46.023691+0100 INFO wdoc 8638483456 93318 printer 92   File "/Users/redacted/.local/share/virtualenvs/wdoc_test-ePoOMSPS/lib/python3.11/site-packages/wdoc/wdoc.py", line 924, in prepare_query_task
    self.loaded_embeddings, self.embeddings = load_embeddings(
                                              ^^^^^^^^^^^^^^^^

2024-11-02T00:33:46.023721+0100 INFO wdoc 8638483456 93318 printer 92   File "/Users/redacted/.local/share/virtualenvs/wdoc_test-ePoOMSPS/lib/python3.11/site-packages/wdoc/utils/embeddings.py", line 181, in load_embeddings
    lfs = LocalFileStore(
          ^^^^^^^^^^^^^^^

2024-11-02T00:33:46.023748+0100 INFO wdoc 8638483456 93318 printer 92   File "/Users/redacted/.local/share/virtualenvs/wdoc_test-ePoOMSPS/lib/python3.11/site-packages/wdoc/utils/customs/compressed_embeddings_cache.py", line 59, in __init__
    self.pdi = PersistDict(
               ^^^^^^^^^^^^

2024-11-02T00:33:46.023775+0100 INFO wdoc 8638483456 93318 printer 92   File "<@beartype(PersistDict.PersistDict.PersistDict.__init__) at 0x11dc0c2c0>", line 207, in __init__

2024-11-02T00:33:46.023801+0100 INFO wdoc 8638483456 93318 printer 92   File "/Users/redacted/.local/share/virtualenvs/wdoc_test-ePoOMSPS/lib/python3.11/site-packages/PersistDict/PersistDict.py", line 100, in __init__
    self.__init_table__()

2024-11-02T00:33:46.023828+0100 INFO wdoc 8638483456 93318 printer 92   File "<@beartype(PersistDict.PersistDict.PersistDict.__init_table__) at 0x11dc0c400>", line 12, in __init_table__

2024-11-02T00:33:46.023853+0100 INFO wdoc 8638483456 93318 printer 92   File "/Users/redacted/.local/share/virtualenvs/wdoc_test-ePoOMSPS/lib/python3.11/site-packages/PersistDict/PersistDict.py", line 147, in __init_table__
    conn = self.__connect__()
           ^^^^^^^^^^^^^^^^^^

2024-11-02T00:33:46.023879+0100 INFO wdoc 8638483456 93318 printer 92   File "/Users/redacted/.local/share/virtualenvs/wdoc_test-ePoOMSPS/lib/python3.11/site-packages/PersistDict/PersistDict.py", line 113, in __connect__
    return sqlite3.connect(
           ^^^^^^^^^^^^^^^^

2024-11-02T00:33:46.023909+0100 INFO wdoc 8638483456 93318 printer 92 <class 'sqlite3.OperationalError'> : unable to open database file
2024-11-02T00:34:16.888374+0100 INFO wdoc 8638483456 93318 printer 92 You are now in the exception handling frame.

I tried running it with '--disable_llm_cache' and deleting the cache folder, but got the same result. Also tried both the main and the dev branch.

I checked the database file itself and it seems empty? Not sure, I'm not really familiar with langchain:

> sqlite3 /Users/redacted/Library/Caches/wdoc/langchain.db
SQLite version 3.43.2 2023-10-10 13:08:14
Enter ".help" for usage hints.
sqlite> .tables
metadata  storage
sqlite> SELECT * FROM storage;
sqlite>
sqlite> select * from metadata;
version|0.1.3
sqlite>
aivisol commented 1 week ago

I ran into same error while running sample script. Creating a directory for sqlite database did the trick:

mkdir /Users/redacted/Library/Caches/wdoc/CacheEmbeddings

thiswillbeyourgithub commented 1 week ago

Hi. I'm really sorry but it's an exceptionnaly busy week for me, I'm normally much more responsive!

I only have my phone on me but I added one line in the latest dev to mkdir the parents of LocalFileStore's db. That should do it.

Sorry for the oversight and thanks a lot for bringing this to my attention.

@havr-p thanks for kindly offering help for Sentry. I've never used it so I'm not sure it's actually needed but if you care to explain to me why I should I'll gladly read that!

Thanks to @aivisol and @blinkenl1ghts this was much quicker to fix on my phone.

Don't hesitate to reopen!

thiswillbeyourgithub commented 1 week ago

This is included in the latest release