gkamradt / langchain-tutorials

Overview and tutorial of the LangChain Library
6.63k stars 1.92k forks source link

UnstructuredPDFLoader zipfile.BadZipFile: File is not a zip file #18

Closed yamyamyuo closed 1 year ago

yamyamyuo commented 1 year ago

Hi there, I was trying Ask a book question tutorial. However I was stuck in the third line data = loader.load(). Do you have any idea why it says my document was not a zip file? It is loading a PDF actually. here is the stacktrace:

Traceback (most recent call last):
  File "/Users/serena/Documents/langchain-tutorials/data_generation/chatPDF.py", line 5, in <module>
    data = loader.load()
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/unstructured.py", line 61, in load
    elements = self._get_elements()
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/pdf.py", line 27, in _get_elements
    from unstructured.partition.pdf import partition_pdf
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/unstructured/partition/pdf.py", line 19, in <module>
    from unstructured.partition.text import partition_text
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/unstructured/partition/text.py", line 16, in <module>
    from unstructured.partition.text_type import (
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/unstructured/partition/text_type.py", line 21, in <module>
    from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/unstructured/nlp/tokenize.py", line 32, in <module>
    _download_nltk_package_if_not_present(package_name, package_category)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/unstructured/nlp/tokenize.py", line 21, in _download_nltk_package_if_not_present
    nltk.find(f"{package_category}/{package_name}")
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/data.py", line 555, in find
    return find(modified_name, paths)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/data.py", line 542, in find
    return ZipFilePathPointer(p, zipentry)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/data.py", line 394, in __init__
    zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/data.py", line 935, in __init__
    zipfile.ZipFile.__init__(self, filename)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/zipfile.py", line 1257, in __init__
    self._RealGetContents()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/zipfile.py", line 1324, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
gkamradt commented 1 year ago

Unstructured gives a ton of people problems. I'm going edit the code and give more options to people.

Thanks for bringing this up and look at the code in a couple hours and I'll have it up

gkamradt commented 1 year ago

Just updated https://github.com/gkamradt/langchain-tutorials/blob/main/data_generation/Ask%20A%20Book%20Questions.ipynb