hwchase17 / chat-your-data

MIT License
901 stars 266 forks source link

build(deps): add bs4; fix typo in README #3

Closed MthwRobinson closed 1 year ago

MthwRobinson commented 1 year ago

Cool application! Raising a small PR to fix a README typo and add beautifulsoup4 to requirements.txt. I get the following ModuleNotFoundError if it's not installed.

Traceback (most recent call last):
  File "ingest_data.py", line 2, in <module>
    from langchain.document_loaders import UnstructuredFileLoader
  File "/Users/mrobinson/.pyenv/versions/langchain/lib/python3.8/site-packages/langchain/document_loaders/__init__.py", line 3, in <module>
    from langchain.document_loaders.azlyrics import AZLyricsLoader
  File "/Users/mrobinson/.pyenv/versions/langchain/lib/python3.8/site-packages/langchain/document_loaders/azlyrics.py", line 5, in <module>
    from langchain.document_loaders.web_base import WebBaseLoader
  File "/Users/mrobinson/.pyenv/versions/langchain/lib/python3.8/site-packages/langchain/document_loaders/web_base.py", line 5, in <module>
    from bs4 import BeautifulSoup
ModuleNotFoundError: No module named 'bs4'
jvonreusner commented 1 year ago

Hi,

New to all of this github and python stuff but chatGPT is helping me get started.

I installed this github repo into my pycharm application (again, I'm a n00b). When I click/run 'python insgest_data.py' as is (with all dependencies installed), I get this error message:

C:\Users\jvonr\PycharmProjects\chat-your-data\venv\Scripts\python.exe ingest_data.py Traceback (most recent call last): File "C:\Users\jvonr\PycharmProjects\chat-your-data\ingest_data.py", line 9, in raw_documents = loader.load() ^^^^^^^^^^^^^ File "C:\Users\jvonr\PycharmProjects\chat-your-data\venv\Lib\site-packages\langchain\document_loaders\unstructured.py", line 35, in load elements = self._get_elements() ^^^^^^^^^^^^^^^^^^^^ File "C:\Users\jvonr\PycharmProjects\chat-your-data\venv\Lib\site-packages\langchain\document_loaders\unstructured.py", line 29, in _get_elements from unstructured.partition.auto import partition File "C:\Users\jvonr\PycharmProjects\chat-your-data\venv\Lib\site-packages\unstructured\partition\auto.py", line 3, in from unstructured.file_utils.filetype import detect_filetype, FileType File "C:\Users\jvonr\PycharmProjects\chat-your-data\venv\Lib\site-packages\unstructured\file_utils\filetype.py", line 6, in import magic File "C:\Users\jvonr\PycharmProjects\chat-your-data\venv\Lib\site-packages\magic__init__.py", line 209, in libmagic = loader.load_lib() ^^^^^^^^^^^^^^^^^ File "C:\Users\jvonr\PycharmProjects\chat-your-data\venv\Lib\site-packages\magic\loader.py", line 49, in load_lib raise ImportError('failed to find libmagic. Check your installation') ImportError: failed to find libmagic. Check your installation

Is there a reason this isn't working straight after installation or am I just dumb and doing something wrong? Thanks!

MthwRobinson commented 1 year ago

@jvonreusner - unstructured uses libmagic for filetype detection. Since it looks like you're on window, I think you need pip install python-magic-bin instead of python-magic. We'll add an issue on the unstructured side to see if we can't clean that up for Windows pip installs.

MthwRobinson commented 1 year ago

Docs for that are here

MthwRobinson commented 1 year ago

Added https://github.com/Unstructured-IO/unstructured/issues/234 to address

jvonreusner commented 1 year ago

Wonderful - thank you so much.

I'm also having trouble verifying which version of python I need, and it doesn't seem to be clearly stated in the requirement documents.

I'm currently using the most up to date version for windows, python 3.11 but am getting error messages about my interpreter being invalid.

ChatGPT tried to give me answer by telling me 3.6.3, but I think it has no idea what it's talking about lol

MthwRobinson commented 1 year ago

3.6.3 is definitely wrong. unstructured won't work with versions 3.6 and older because of the pytorch dependency for the PDF partitioning model (though if you don't include the local-inference extra dependencies it won't pull that in and may work for you). We currently test against 3.8 and have an issue to add later Python versions to CI (this one here https://github.com/Unstructured-IO/unstructured/issues/145). We've gotten it working on 3.10 before and I wouldn't think 3.11 would be an issue.

If you're running on Windows, I'd also check out these instructions from our docs. Long story short, the detectron2 model we use for PDF partitioning doesn't support Windows, but there's a workaround you can use to get it running. We do intend to move to a new model for PDF partitioning in the near future that should be more Windows friendly.