langchain-ai / langchain

πŸ¦œπŸ”— Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.67k stars 14.83k forks source link

UnpicklingError: pickle data was truncated #6850

Closed heavenkiller2018 closed 11 months ago

heavenkiller2018 commented 1 year ago

System Info

❯ pip list |grep unstructured unstructured 0.7.9 ❯ pip list |grep langchain langchain 0.0.215 langchainplus-sdk 0.0.17

Who can help?

No response

Information

Related Components

Reproduction

from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("../modules/tk.txt")
document = loader.load()

errors:

UnpicklingError                           Traceback (most recent call last)
Cell In[11], line 3
      1 from langchain.document_loaders import UnstructuredFileLoader
      2 loader = UnstructuredFileLoader("../modules/tk.txt")
----> 3 document = loader.load()

File [~/micromamba/envs/openai/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py:71](https://vscode-remote+ssh-002dremote-002bubuntu-002eh.vscode-resource.vscode-cdn.net/home/john/project/testscratch/python/hello-world/llm/langchain/getting_started_guide_zh/~/micromamba/envs/openai/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py:71), in UnstructuredBaseLoader.load(self)
     69 def load(self) -> List[Document]:
     70     """Load file."""
---> 71     elements = self._get_elements()
     72     if self.mode == "elements":
     73         docs: List[Document] = list()

File [~/micromamba/envs/openai/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py:133](https://vscode-remote+ssh-002dremote-002bubuntu-002eh.vscode-resource.vscode-cdn.net/home/john/project/testscratch/python/hello-world/llm/langchain/getting_started_guide_zh/~/micromamba/envs/openai/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py:133), in UnstructuredFileLoader._get_elements(self)
    130 def _get_elements(self) -> List:
    131     from unstructured.partition.auto import partition
--> 133     return partition(filename=self.file_path, **self.unstructured_kwargs)

File [~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/partition/auto.py:193](https://vscode-remote+ssh-002dremote-002bubuntu-002eh.vscode-resource.vscode-cdn.net/home/john/project/testscratch/python/hello-world/llm/langchain/getting_started_guide_zh/~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/partition/auto.py:193), in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, ssl_verify, ocr_languages, pdf_infer_table_structure, xml_keep_tags, data_source_metadata, **kwargs)
    183     elements = partition_image(
    184         filename=filename,  # type: ignore
    185         file=file,  # type: ignore
   (...)
    190         **kwargs,
    191     )
    192 elif filetype == FileType.TXT:
--> 193     elements = partition_text(
    194         filename=filename,
    195         file=file,
    196         encoding=encoding,
    197         paragraph_grouper=paragraph_grouper,
    198         **kwargs,
    199     )
    200 elif filetype == FileType.RTF:
    201     elements = partition_rtf(
    202         filename=filename,
    203         file=file,
    204         include_page_breaks=include_page_breaks,
    205         **kwargs,
    206     )

File [~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/documents/elements.py:118](https://vscode-remote+ssh-002dremote-002bubuntu-002eh.vscode-resource.vscode-cdn.net/home/john/project/testscratch/python/hello-world/llm/langchain/getting_started_guide_zh/~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/documents/elements.py:118), in process_metadata..decorator..wrapper(*args, **kwargs)
    116 @wraps(func)
    117 def wrapper(*args, **kwargs):
--> 118     elements = func(*args, **kwargs)
    119     sig = inspect.signature(func)
    120     params = dict(**dict(zip(sig.parameters, args)), **kwargs)

File [~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:493](https://vscode-remote+ssh-002dremote-002bubuntu-002eh.vscode-resource.vscode-cdn.net/home/john/project/testscratch/python/hello-world/llm/langchain/getting_started_guide_zh/~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:493), in add_metadata_with_filetype..decorator..wrapper(*args, **kwargs)
    491 @wraps(func)
    492 def wrapper(*args, **kwargs):
--> 493     elements = func(*args, **kwargs)
    494     sig = inspect.signature(func)
    495     params = dict(**dict(zip(sig.parameters, args)), **kwargs)

File [~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/partition/text.py:92](https://vscode-remote+ssh-002dremote-002bubuntu-002eh.vscode-resource.vscode-cdn.net/home/john/project/testscratch/python/hello-world/llm/langchain/getting_started_guide_zh/~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/partition/text.py:92), in partition_text(filename, file, text, encoding, paragraph_grouper, metadata_filename, include_metadata, **kwargs)
     89 ctext = ctext.strip()
     91 if ctext:
---> 92     element = element_from_text(ctext)
     93     element.metadata = metadata
     94     elements.append(element)

File [~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/partition/text.py:104](https://vscode-remote+ssh-002dremote-002bubuntu-002eh.vscode-resource.vscode-cdn.net/home/john/project/testscratch/python/hello-world/llm/langchain/getting_started_guide_zh/~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/partition/text.py:104), in element_from_text(text)
    102 elif is_us_city_state_zip(text):
    103     return Address(text=text)
--> 104 elif is_possible_narrative_text(text):
    105     return NarrativeText(text=text)
    106 elif is_possible_title(text):

File [~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/partition/text_type.py:86](https://vscode-remote+ssh-002dremote-002bubuntu-002eh.vscode-resource.vscode-cdn.net/home/john/project/testscratch/python/hello-world/llm/langchain/getting_started_guide_zh/~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/partition/text_type.py:86), in is_possible_narrative_text(text, cap_threshold, non_alpha_threshold, language, language_checks)
     83 if under_non_alpha_ratio(text, threshold=non_alpha_threshold):
     84     return False
---> 86 if (sentence_count(text, 3) < 2) and (not contains_verb(text)) and language == "en":
     87     trace_logger.detail(f"Not narrative. Text does not contain a verb:\n\n{text}")  # type: ignore # noqa: E501
     88     return False

File [~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/partition/text_type.py:189](https://vscode-remote+ssh-002dremote-002bubuntu-002eh.vscode-resource.vscode-cdn.net/home/john/project/testscratch/python/hello-world/llm/langchain/getting_started_guide_zh/~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/partition/text_type.py:189), in contains_verb(text)
    186 if text.isupper():
    187     text = text.lower()
--> 189 pos_tags = pos_tag(text)
    190 return any(tag in POS_VERB_TAGS for _, tag in pos_tags)

File [~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/nlp/tokenize.py:57](https://vscode-remote+ssh-002dremote-002bubuntu-002eh.vscode-resource.vscode-cdn.net/home/john/project/testscratch/python/hello-world/llm/langchain/getting_started_guide_zh/~/micromamba/envs/openai/lib/python3.11/site-packages/unstructured/nlp/tokenize.py:57), in pos_tag(text)
     55 for sentence in sentences:
     56     tokens = _word_tokenize(sentence)
---> 57     parts_of_speech.extend(_pos_tag(tokens))
     58 return parts_of_speech

File [~/micromamba/envs/openai/lib/python3.11/site-packages/nltk/tag/__init__.py:165](https://vscode-remote+ssh-002dremote-002bubuntu-002eh.vscode-resource.vscode-cdn.net/home/john/project/testscratch/python/hello-world/llm/langchain/getting_started_guide_zh/~/micromamba/envs/openai/lib/python3.11/site-packages/nltk/tag/__init__.py:165), in pos_tag(tokens, tagset, lang)
    140 def pos_tag(tokens, tagset=None, lang="eng"):
    141     """
    142     Use NLTK's currently recommended part of speech tagger to
    143     tag the given list of tokens.
   (...)
    163     :rtype: list(tuple(str, str))
    164     """
--> 165     tagger = _get_tagger(lang)
    166     return _pos_tag(tokens, tagset, tagger, lang)

File [~/micromamba/envs/openai/lib/python3.11/site-packages/nltk/tag/__init__.py:107](https://vscode-remote+ssh-002dremote-002bubuntu-002eh.vscode-resource.vscode-cdn.net/home/john/project/testscratch/python/hello-world/llm/langchain/getting_started_guide_zh/~/micromamba/envs/openai/lib/python3.11/site-packages/nltk/tag/__init__.py:107), in _get_tagger(lang)
    105     tagger.load(ap_russian_model_loc)
    106 else:
--> 107     tagger = PerceptronTagger()
    108 return tagger

File [~/micromamba/envs/openai/lib/python3.11/site-packages/nltk/tag/perceptron.py:169](https://vscode-remote+ssh-002dremote-002bubuntu-002eh.vscode-resource.vscode-cdn.net/home/john/project/testscratch/python/hello-world/llm/langchain/getting_started_guide_zh/~/micromamba/envs/openai/lib/python3.11/site-packages/nltk/tag/perceptron.py:169), in PerceptronTagger.__init__(self, load)
    165 if load:
    166     AP_MODEL_LOC = "file:" + str(
    167         find("taggers/averaged_perceptron_tagger/" + PICKLE)
    168     )
--> 169     self.load(AP_MODEL_LOC)

File [~/micromamba/envs/openai/lib/python3.11/site-packages/nltk/tag/perceptron.py:252](https://vscode-remote+ssh-002dremote-002bubuntu-002eh.vscode-resource.vscode-cdn.net/home/john/project/testscratch/python/hello-world/llm/langchain/getting_started_guide_zh/~/micromamba/envs/openai/lib/python3.11/site-packages/nltk/tag/perceptron.py:252), in PerceptronTagger.load(self, loc)
    246 def load(self, loc):
    247     """
    248     :param loc: Load a pickled model at location.
    249     :type loc: str
    250     """
--> 252     self.model.weights, self.tagdict, self.classes = load(loc)
    253     self.model.classes = self.classes

File [~/micromamba/envs/openai/lib/python3.11/site-packages/nltk/data.py:755](https://vscode-remote+ssh-002dremote-002bubuntu-002eh.vscode-resource.vscode-cdn.net/home/john/project/testscratch/python/hello-world/llm/langchain/getting_started_guide_zh/~/micromamba/envs/openai/lib/python3.11/site-packages/nltk/data.py:755), in load(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding)
    753     resource_val = opened_resource.read()
    754 elif format == "pickle":
--> 755     resource_val = pickle.load(opened_resource)
    756 elif format == "json":
    757     import json

UnpicklingError: pickle data was truncated

how to fix it

Expected behavior

no

heavenkiller2018 commented 1 year ago

and layoutparser[layoutmodels,tesseract] can't be installed correctly

Unstructured File | πŸ¦œοΈπŸ”— Langchain

pip install layoutparser[layoutmodels,tesseract]

errors:

...
Collecting iopath (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.5.1->unstructured[local-inference])
  Using cached iopath-0.1.10.tar.gz (42 kB)
  Preparing metadata (setup.py) ... Collecting pdfplumber (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.5.1->unstructured[local-inference])
  Downloading pdfplumber-0.9.0-py3-none-any.whl (46 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.1/46.1 kB 9.4 MB/s eta 0:00:00
Requirement already satisfied: torch in /home/john/micromamba/envs/openai/lib/python3.11/site-packages (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.5.1->unstructured[local-inference]) (2.0.1)
Requirement already satisfied: torchvision in /home/john/micromamba/envs/openai/lib/python3.11/site-packages (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.5.1->unstructured[local-inference]) (0.15.2)
Collecting effdet (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.5.1->unstructured[local-inference])
  Downloading effdet-0.4.1-py3-none-any.whl (112 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 112.5/112.5 kB 19.0 MB/s eta 0:00:00
Collecting pytesseract (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.5.1->unstructured[local-inference])
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Requirement already satisfied: coloredlogs in /home/john/micromamba/envs/openai/lib/python3.11/site-packages (from onnxruntime->unstructured-inference==0.5.1->unstructured[local-inference]) (15.0.1)
Requirement already satisfied: flatbuffers in /home/john/micromamba/envs/openai/lib/python3.11/site-packages (from onnxruntime->unstructured-inference==0.5.1->unstructured[local-inference]) (23.5.26)
INFO: pip is looking at multiple versions of onnxruntime to determine which version is compatible with other requirements. This could take a while.
Collecting onnxruntime (from unstructured-inference==0.5.1->unstructured[local-inference])
  Downloading onnxruntime-1.15.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.9/5.9 MB 21.3 MB/s eta 0:00:0000:0100:01
Collecting layoutparser[layoutmodels,tesseract] (from unstructured-inference==0.5.1->unstructured[local-inference])
  Downloading layoutparser-0.3.3-py3-none-any.whl (19.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.2/19.2 MB 17.8 MB/s eta 0:00:0000:0100:01
  Downloading layoutparser-0.3.2-py3-none-any.whl (19.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.2/19.2 MB 12.8 MB/s eta 0:00:0000:0100:01
  Downloading layoutparser-0.3.1-py3-none-any.whl (19.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.2/19.2 MB 17.1 MB/s eta 0:00:0000:0100:01
INFO: pip is looking at multiple versions of onnxruntime to determine which version is compatible with other requirements. This could take a while.
  Downloading layoutparser-0.3.0-py3-none-any.whl (19.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.2/19.2 MB 21.1 MB/s eta 0:00:0000:0100:01
  Downloading layoutparser-0.2.0-py3-none-any.whl (19.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.1/19.1 MB 18.1 MB/s eta 0:00:0000:0100:01
WARNING: layoutparser 0.2.0 does not provide the extra 'layoutmodels'
WARNING: layoutparser 0.2.0 does not provide the extra 'tesseract'
WARNING: layoutparser 0.2.0 does not provide the extra 'layoutmodels'
WARNING: layoutparser 0.2.0 does not provide the extra 'tesseract'
  Downloading layoutparser-0.1.3-py3-none-any.whl (19.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.1/19.1 MB 21.8 MB/s eta 0:00:0000:0100:01
WARNING: layoutparser 0.1.3 does not provide the extra 'layoutmodels'
WARNING: layoutparser 0.1.3 does not provide the extra 'tesseract'
Collecting pycocotools==2.0.1 (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.5.1->unstructured[local-inference])
  Downloading pycocotools-2.0.1.tar.gz (23 kB)
  Preparing metadata (setup.py) ... Collecting fvcore==0.1.1.post20200623 (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.5.1->unstructured[local-inference])
  Downloading fvcore-0.1.1.post20200623.tar.gz (32 kB)
  Preparing metadata (setup.py) ... Collecting yacs>=0.1.6 (from fvcore==0.1.1.post20200623->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.5.1->unstructured[local-inference])
  Downloading yacs-0.1.8-py3-none-any.whl (14 kB)
Collecting portalocker (from fvcore==0.1.1.post20200623->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.5.1->unstructured[local-inference])
  Using cached portalocker-2.7.0-py2.py3-none-any.whl (15 kB)
Collecting termcolor>=1.1 (from fvcore==0.1.1.post20200623->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.5.1->unstructured[local-inference])
  Using cached termcolor-2.3.0-py3-none-any.whl (6.9 kB)
Requirement already satisfied: setuptools>=18.0 in /home/john/micromamba/envs/openai/lib/python3.11/site-packages (from pycocotools==2.0.1->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.5.1->unstructured[local-inference]) (67.8.0)
Collecting cython>=0.27.3 (from pycocotools==2.0.1->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.5.1->unstructured[local-inference])
  Using cached Cython-0.29.35-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (1.9 MB)
Requirement already satisfied: matplotlib>=2.1.0 in /home/john/micromamba/envs/openai/lib/python3.11/site-packages (from pycocotools==2.0.1->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.5.1->unstructured[local-inference]) (3.6.3)
INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. See https://pip.pypa.io/warnings/backtracking for guidance. If you want to abort this run, press Ctrl + C.
WARNING: layoutparser 0.1.3 does not provide the extra 'layoutmodels'
WARNING: layoutparser 0.1.3 does not provide the extra 'tesseract'
Collecting layoutparser[layoutmodels,tesseract] (from unstructured-inference==0.5.1->unstructured[local-inference])
  Downloading layoutparser-0.1.2-py3-none-any.whl (19.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.1/19.1 MB 11.3 MB/s eta 0:00:0000:0100:01
WARNING: layoutparser 0.1.2 does not provide the extra 'layoutmodels'
WARNING: layoutparser 0.1.2 does not provide the extra 'tesseract'
Collecting pycocotools (from layoutparser[layoutmodels,tesseract]->unstructured-inference==0.5.1->unstructured[local-inference])
  Using cached pycocotools-2.0.6.tar.gz (24 kB)
  Installing build dependencies ...   Getting requirements to build wheel ...   Preparing metadata (pyproject.toml) ... WARNING: layoutparser 0.1.2 does not provide the extra 'layoutmodels'
WARNING: layoutparser 0.1.2 does not provide the extra 'tesseract'
Collecting layoutparser[layoutmodels,tesseract] (from unstructured-inference==0.5.1->unstructured[local-inference])
  Downloading layoutparser-0.1.1-py3-none-any.whl (19.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.1/19.1 MB 12.4 MB/s eta 0:00:0000:0100:01
WARNING: layoutparser 0.1.1 does not provide the extra 'layoutmodels'
WARNING: layoutparser 0.1.1 does not provide the extra 'tesseract'
INFO: pip is looking at multiple versions of layoutparser[layoutmodels,tesseract] to determine which version is compatible with other requirements. This could take a while.
  Downloading layoutparser-0.1.0-py3-none-any.whl (19.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.1/19.1 MB 11.9 MB/s eta 0:00:0000:0100:01
WARNING: layoutparser 0.1.0 does not provide the extra 'layoutmodels'
WARNING: layoutparser 0.1.0 does not provide the extra 'tesseract'
  Downloading layoutparser-0.0.1-py3-none-any.whl (10 kB)
WARNING: layoutparser 0.0.1 does not provide the extra 'layoutmodels'
WARNING: layoutparser 0.0.1 does not provide the extra 'tesseract'
WARNING: layoutparser 0.0.1 does not provide the extra 'layoutmodels'
WARNING: layoutparser 0.0.1 does not provide the extra 'tesseract'
WARNING: layoutparser 0.1.1 does not provide the extra 'layoutmodels'
WARNING: layoutparser 0.1.1 does not provide the extra 'tesseract'
WARNING: layoutparser 0.1.0 does not provide the extra 'layoutmodels'
WARNING: layoutparser 0.1.0 does not provide the extra 'tesseract'
ERROR: Cannot install layoutparser[layoutmodels,tesseract]==0.1.0 and layoutparser[layoutmodels,tesseract]==0.1.1 because these package versions have conflicting dependencies.

The conflict is caused by:
    layoutparser[layoutmodels,tesseract] 0.1.1 depends on torch==1.4
    layoutparser[layoutmodels,tesseract] 0.1.0 depends on torch==1.4

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
Note: you may need to restart the kernel to use updated packages.
zsh:1: no matches found: layoutparser[layoutmodels,tesseract]
Note: you may need to restart the kernel to use updated packages.
dosubot[bot] commented 12 months ago

Hi, @heavenkiller2018! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you are experiencing a UnpicklingError when using the UnstructuredFileLoader from the langchain library. The error message suggests that the pickle data is truncated. Additionally, you mentioned having trouble installing layoutparser[layoutmodels,tesseract] and provided the error message.

Before we close this issue, we wanted to check with you if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and cooperation!