Open isaac-aburto opened 9 months ago
Looks like unstructured barfs:
Current thread 0x000015cc (most recent call first):
File "<frozen importlib._bootstrap>", line 688 in _load_unlocked
File "<frozen importlib._bootstrap>", line 1006 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 1027 in _find_and_load
File "C:\ProgramData\miniconda3\envs\h2ogpt\lib\site-packages\langchain_community\document_loaders\pdf.py", line 57 in _get_elements
File "C:\ProgramData\miniconda3\envs\h2ogpt\lib\site-packages\langchain_community\document_loaders\unstructured.py", line 87 in load
Can you pip install an older version of unstructured or see if any other changes help?
I also see:
Current thread 0x000015cc (most recent call first):
File "C:\ProgramData\miniconda3\envs\h2ogpt\lib\ctypes\__init__.py", line 374 in __init__
File "C:\ProgramData\miniconda3\envs\h2ogpt\lib\site-packages\magic\loader.py", line 44 in load_lib
File "C:\ProgramData\miniconda3\envs\h2ogpt\lib\site-packages\magic\__init__.py", line 209 in <module>
File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
File "<frozen importlib._bootstrap_external>", line 883 in exec_module
File "<frozen importlib._bootstrap>", line 688 in _load_unlocked
File "<frozen importlib._bootstrap>", line 1006 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 1027 in _find_and_load
File "C:\ProgramData\miniconda3\envs\h2ogpt\lib\site-packages\unstructured\file_utils\filetype.py", line 25 in <module>
File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
File "<frozen importlib._bootstrap_external>", line 883 in exec_module
File "<frozen importlib._bootstrap>", line 688 in _load_unlocked
File "<frozen importlib._bootstrap>", line 1006 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 1027 in _find_and_load
File "C:\ProgramData\miniconda3\envs\h2ogpt\lib\site-packages\unstructured\partition\pdf.py", line 57 in <module>
File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
File "<frozen importlib._bootstrap_external>", line 883 in exec_module
File "<frozen importlib._bootstrap>", line 688 in _load_unlocked
File "<frozen importlib._bootstrap>", line 1006 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 1027 in _find_and_load
File "C:\ProgramData\miniconda3\envs\h2ogpt\lib\site-packages\langchain_community\document_loaders\pdf.py", line 57 in _get_elements
File "C:\ProgramData\miniconda3\envs\h2ogpt\lib\site-packages\langchain_community\document_loaders\unstructured.py", line 87 in load
File "C:\Users\Administrator\h2ogpt\src\gpt_langchain.py", line 3212 in file_to_doc
Maybe there is crash due to multiple threads trying to do some imports or access some libraries. Known python bugs. Maybe can move imports earlier to avoid such races.
E.g. you can add these to top of gpt_langchain.py
import magic
from unstructured.partition.pdf import partition_pdf
Let me know if that helps, and I can move some imports outside local scopes.
I tried with an older version of unstructured, but doesn't work. Same when trying to change some imports in the code.
Thanks to your response I decided to review certain libraries, especially the magic.
The error Windows fatal exception: access violation, apparently happen in the file: C:\ProgramData\miniconda3\envs\h2ogpt\Lib\site-packages\magic\loader.py
def _lib_candidates():
yield find_library('magic')
#print("sys.platform: ", sys.platform)
if sys.platform == "darwin":
paths = [
'/opt/local/lib',
'/usr/local/lib',
'/opt/homebrew/lib',
] + glob.glob('/usr/local/Cellar/libmagic/*/lib')
for i in paths:
yield os.path.join(i, 'libmagic.dylib')
elif sys.platform in ("win32", "cygwin"):
#prefixes = ['msys-magic-1', 'libmagic', 'magic1', 'cygmagic-1', 'libmagic-1']
prefixes = ['libmagic']
for i in prefixes:
# find_library searches in %PATH% but not the current directory,
# so look for both
yield './%s.dll' % (i,)
yield find_library(i)
The code was trying to get into these dll files, but they did not exist in the folder. What I did was move the file located at: C:\ProgramData\miniconda3\envs\h2ogpt\Lib\site-packages\magic\libmagic\libmagic.dll
to C:\ProgramData\miniconda3\envs\h2ogpt\Library\usr\bin
and commented the list of files that could not be found.
I don't know if it's the best solution, but it's the only one that has helped me.
Interesting, thanks. I'll see if I can understand.
I am working on an EC2 instance (g4dn.xlarge)
The installation is going well. It works perfectly if I upload any other type of file (txt, csv, xml...), but when I try to upload a PDF file I get the error and the application stops.