Closed botchagalupe closed 3 months ago
Answer generated by a 🤖
Hello,
Based on the error message you provided, it seems like the issue is related to the 'partition_pdf' function not being defined. This function is part of the Pdfminer library, which is used in the LangChain framework.
From the context provided, there was a similar issue reported in the past (Loading pdf files from directory gives the following error). The suggested solution was to switch your Python version from 3.10 to 3.9. This is because Pdfminer, which is used in the LangChain framework, was last updated in 2019 and works on Python 3.7, 3.8, and 3.9.
If you are currently using Python 3.10, I would recommend downgrading to Python 3.9 and see if that resolves the issue. You can do this by creating a new virtual environment with Python 3.9. Here's how you can do it:
python3.9 -m venv env
source env/bin/activate
pip install langchain==0.0.251
If you are already using Python 3.9 or lower and still encountering this issue, it would be helpful if you could provide more information about your environment, such as the exact version of LangChain you are using and any recent changes you made to your code or environment.
I hope this helps! Let me know if you have any other questions or if the issue persists.
This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
It's been running successfully on 3.10 for several weeks without any changes. I'm pretty sure this is an error related to a recent change in Langchain.
Answer generated by a 🤖
Hello,
The error you're encountering, NameError: name 'partition_pdf' is not defined
, is likely due to the unstructured
library not being installed or not being in the Python path. The partition_pdf
function is imported from the unstructured
library in the UnstructuredPDFLoader
class in your script.
To resolve this issue, you need to ensure that the unstructured
library is installed and accessible in your Python environment. You can install it using pip:
pip install unstructured
If the library is already installed, make sure that it's in the Python path. You can check your Python path by running the following command in your Python environment:
import sys
print(sys.path)
If the path to the unstructured
library is not in the output, you need to add it. You can do this by appending the path to the unstructured
library to the sys.path
list:
sys.path.append('/path/to/unstructured')
Replace '/path/to/unstructured'
with the actual path to the unstructured
library in your system. After doing this, you should be able to import the partition_pdf
function and the NameError
should be resolved.
To give you a bit more context, the 'partition_pdf' function is used in the 'UnstructuredPDFLoader' class in the LangChain codebase. This function is imported from the 'unstructured.partition.pdf' module. The purpose of this function is to split the PDF document into different elements such as Title and NarrativeText. This is used when the 'UnstructuredPDFLoader' is run in "elements" mode. The function takes the file path of the PDF document as an argument, along with any additional keyword arguments passed to the 'UnstructuredPDFLoader'.
I hope this helps! Let me know if you have any other questions or if there's anything else I can do to assist you.
This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
unstructured is installed.
/usr/lib/python3.10# pip show unstructured Name: unstructured Version: 0.9.0 Summary: A library that prepares raw documents for downstream ML tasks. Home-page: https://github.com/Unstructured-IO/unstructured Author: Unstructured Technologies Author-email: devops@unstructuredai.io License: Apache-2.0 Location: /usr/local/lib/python3.10/dist-packages Requires: chardet, filetype, lxml, nltk, python-magic, requests, tabulate Required-by:
Hi @botchagalupe,
If you have a full stack trace that would be more helpful (the one included in your previous response is abbreviated).
Based on that stack trace I'd check if this is an unstructured issue, since it's exception origins from unstructured complaining about being unable to resolve a function.
You could try upgrading/downgrading unstructured -- it's possible that unstructured got upgraded while bumping langchain version.
Thanks For looking into this...
Does this help?
Error loading documents
Traceback (most recent call last):
File "
I'm experiencing exactly the same issue, which appeared after no code change, just update of langchain and I presume unstructured library.
After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.
That worked it. Thank You.
doing the same.. trying to run a Langchain project.. uninstalling and trying % python3 -m pip install unstructured==0.7.12
After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.
Downgrading unstructured worked so far, thx.
Thank you @FirstGalacticEmpire !
After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.
Worked for me too. Thanks so much !
Thanks
On Mon, Aug 7, 2023 at 5:38 AM siidny @.***> wrote:
After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.
Worked for me too. Thanks so much !
— Reply to this email directly, view it on GitHub https://github.com/langchain-ai/langchain/issues/8714#issuecomment-1667100737, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2KK3RQ7SLWI3XU3YKVWFCLXUBIIRANCNFSM6AAAAAA3DHVHNE . You are receiving this because you commented.Message ID: @.***>
Also worked for me, thanks!
After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.
Worked for me too. Thanks so much !
I was running into this issue - for me the problem was that unstructured has several dependencies for cracking PDFs, but it doesn't expose that very well. The import statement for partition_pdf has an if else
check to see if pdf2image, pdfminer, and PIL
are all installed, but if they aren't it doesn't throw an error, it simply doesn't import the module.
Once I installed pdf2image, pdfminer (and it's subpackage pdfminer.six), and pillows, everything ran as expected.
I was running into this issue - for me the problem was that unstructured has several dependencies for cracking PDFs, but it doesn't expose that very well. The import statement for partition_pdf has an
if else
check to see ifpdf2image, pdfminer, and PIL
are all installed, but if they aren't it doesn't throw an error, it simply doesn't import the module.Once I installed pdf2image, pdfminer (and it's subpackage pdfminer.six), and pillows, everything ran as expected.
this worked for me. Thanks!
I'm getting this error in my jupyter notebook. I tried the above solution of putting in a specific version of unstructured that @FirstGalacticEmpire referenced. I also tried the additional packages that @tarockey suggested. Still nothing.
What's even stranger is there are no PDFs in the directory, so I wonder why it's even spending any time going through this. Below is the error I'm getting:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[15], line 2
1 loader = DirectoryLoader("./articles","*.txt")
----> 2 index = VectorstoreIndexCreator().from_loaders([loader])
File ~/Library/Python/3.9/lib/python/site-packages/langchain/indexes/vectorstore.py:81, in VectorstoreIndexCreator.from_loaders(self, loaders)
79 docs = []
80 for loader in loaders:
---> 81 docs.extend(loader.load())
82 return self.from_documents(docs)
File ~/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/directory.py:156, in DirectoryLoader.load(self)
154 else:
155 for i in items:
--> 156 self.load_file(i, p, docs, pbar)
158 if pbar:
159 pbar.close()
File ~/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/directory.py:105, in DirectoryLoader.load_file(self, item, path, docs, pbar)
103 logger.warning(f"Error loading file {str(item)}: {e}")
104 else:
--> 105 raise e
106 finally:
107 if pbar:
File ~/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/directory.py:99, in DirectoryLoader.load_file(self, item, path, docs, pbar)
97 try:
98 logger.debug(f"Processing file: {str(item)}")
---> 99 sub_docs = self.loader_cls(str(item), **self.loader_kwargs).load()
100 docs.extend(sub_docs)
101 except Exception as e:
File ~/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/unstructured.py:86, in UnstructuredBaseLoader.load(self)
84 def load(self) -> List[Document]:
85 """Load file."""
---> 86 elements = self._get_elements()
87 self._post_process_elements(elements)
88 if self.mode == "elements":
File ~/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/unstructured.py:172, in UnstructuredFileLoader._get_elements(self)
169 def _get_elements(self) -> List:
170 from unstructured.partition.auto import partition
--> 172 return partition(filename=self.file_path, **self.unstructured_kwargs)
File ~/Library/Python/3.9/lib/python/site-packages/unstructured/partition/auto.py:361, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, xml_keep_tags, data_source_metadata, metadata_filename, **kwargs)
352 elements = _partition_md(
353 filename=filename,
354 file=file,
(...)
358 **kwargs,
359 )
360 elif filetype == FileType.PDF:
--> 361 _partition_pdf = _get_partition_with_extras("pdf")
362 elements = _partition_pdf(
363 filename=filename, # type: ignore
364 file=file, # type: ignore
(...)
370 **kwargs,
371 )
372 elif (filetype == FileType.PNG) or (filetype == FileType.JPG) or (filetype == FileType.TIFF):
File ~/Library/Python/3.9/lib/python/site-packages/unstructured/partition/auto.py:113, in _get_partition_with_extras(doc_type, partition_with_extras_map)
111 _partition_func = partition_with_extras_map.get(doc_type)
112 if _partition_func is None:
--> 113 raise ImportError(
114 f"partition_{doc_type} is not available. "
115 f"Install the {doc_type} dependencies with "
116 f'pip install "unstructured[{doc_type}]"',
117 )
118 return _partition_func
ImportError: partition_pdf is not available. Install the pdf dependencies with pip install "unstructured[pdf]"`
I restarted my kernal and this particular error disappeared (but I'm getting other dependency issues).
I was wondering why this is showing up as an error in my jupyter notebook but I have no issues when running the same code in pycharm
just FYI latest error is
ImportError: cannot import name 'PDFResourceManager' from 'pdfminer.converter' (/Users/xxxyyyzzz/Library/Python/3.9/lib/python/site-packages/pdfminer/converter.py)
After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.
Really helpful. Thanks for your comment
Terrific, work form me too, please fix unstructured issue.
I am currently building a chatbot from pdf files. I was having same issue. One of the way i found was using pypdf So langchain.document_loaders has PyPDFLoader for single files and PyPDFDirectoryLoader for accessing multiple files in same folder. This helps in loading the files without showing partition_pdf error. Hope this helps.
It's a problem related to the LangChain 0.1.3 (latest) when using Unstructured 0.12.2 (latest).
Everything works correctly when using partition_pdf
separately, so it's installed and available.
from unstructured.partition.pdf import partition_pdf
test_elements = partition_pdf("xxx.pdf")
🤖
Thank you for providing the full stack trace. It seems like the issue is indeed related to the unstructured
library. The partition_pdf
function is not being recognized, which is causing the NameError
.
The partition_pdf
function is part of the unstructured
library and is used to partition PDF documents into different elements. This function is used in the UnstructuredPDFLoader
class in LangChain, specifically in the _get_elements
method.
Here is the relevant code from the UnstructuredPDFLoader
class:
def _get_elements(self) -> List:
from unstructured.partition.pdf import partition_pdf
return partition_pdf(filename=self.file_path, **self.unstructured_kwargs)
As you can see, the partition_pdf
function is imported from the unstructured.partition.pdf
module and is used to partition the PDF document.
Given that the unstructured
library is installed in your environment (as confirmed by the output of pip show unstructured
), it's possible that there might be an issue with the specific version of the unstructured
library you have installed.
You could try downgrading or upgrading the unstructured
library to see if that resolves the issue. Here are the commands to do so:
To upgrade:
pip install unstructured --upgrade
To downgrade:
pip install unstructured==<previous_version>
Replace <previous_version>
with the version number of the unstructured
library that was working for you.
If the issue persists after trying these steps, it would be helpful to know the exact version of LangChain and unstructured
you're using. This will help in further diagnosing the issue.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@dosu-bot As I wrote above, I am using the latest versions as LangChain 0.1.3 (latest) and Unstructured 0.12.2 (latest). The environment is Python 3.10.5. Because if you import from unstructured.partition.pdf import partition_pdf
separately (outside of the from langchain.document_loaders import DirectoryLoader
), it works correctly, I tried to explain such a behavior.
The error message ImportError: partition_pdf is not available. Install the pdf dependencies with pip install "unstructured[pdf]"
comes from the call _partition_pdf = _get_partition_with_extras("pdf")
of the unstructured > partition > auto.py
code.
def _get_partition_with_extras(
doc_type: str,
partition_with_extras_map: Optional[Dict[str, Callable]] = None,
):
if partition_with_extras_map is None:
partition_with_extras_map = PARTITION_WITH_EXTRAS_MAP
_partition_func = partition_with_extras_map.get(doc_type)
if _partition_func is None:
raise ImportError(
f"partition_{doc_type} is not available. "
f"Install the {doc_type} dependencies with "
f'pip install "unstructured[{doc_type}]"',
)
return _partition_func
and
pdf_imports = ["pdf2image", "pdfminer", "PIL"]
if all(dependency_exists(dep) for dep in pdf_imports):
from unstructured.partition.pdf import partition_pdf
PARTITION_WITH_EXTRAS_MAP["pdf"] = partition_pdf
There are required ["pdf2image", "pdfminer," "PIL"] dependencies, and one of them, the PIL library, stands for Python Imaging Library, and it's the original library that enabled Python to deal with images. PIL was discontinued in 2011 and only supported Python 2.
That also explains the problem because PIL
can't be installed in the Python v3 environment. The Pillow
library should probably replace it. That dependency condition is not applied when you import from unstructured.partition.pdf import partition_pdf
separately.
I don't know the complete history of the unstructured
library, but it seems that its latest version doesn't support Python v3.
As I described in my previous comment, it seems that the problem is on the Unstructured library side, and here is the issue I have created in its repository: https://github.com/Unstructured-IO/unstructured/issues/2456. It would be nice of you to support a solution to the problem and express your opinion there. Finally, some consequences affect the LangChain side also.
After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.
Getting below error on installing this version. register_finder(pkgutil.ImpImporter, find_on_path) ^^^^^^^^^^^^^^^^^^^ AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'? [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> See above for output. note: This error originates from a subprocess, and is likely not a problem with pip.
After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.
Getting below error on installing this version. register_finder(pkgutil.ImpImporter, find_on_path) ^^^^^^^^^^^^^^^^^^^ AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'? [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> See above for output. note: This error originates from a subprocess, and is likely not a problem with pip.
I am also encountering the same error while installing the unstructured==0.7.12. Any suggestions on this?
System Info
I have been running this code for weeks, and today it looks like something changed to break it. I'm using the following code snippets...
from langchain.document_loaders import DirectoryLoader ... loader = DirectoryLoader(directory_path, glob='*/.pdf') documents = loader.load()
This is the error I am getting...
NameError Traceback (most recent call last) in <cell line: 2>() 1 loader = DirectoryLoader(directory_path, glob='*/.pdf') ----> 2 documents = loader.load() 3 print("Number of documents: ", len(documents)) 4 5 timestampit()
5 frames /usr/local/lib/python3.10/dist-packages/unstructured/partition/auto.py in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, pdf_infer_table_structure, xml_keep_tags, data_source_metadata, **kwargs) 219 ) 220 elif filetype == FileType.PDF: --> 221 elements = partition_pdf( 222 filename=filename, # type: ignore 223 file=file, # type: ignore
NameError: name 'partition_pdf' is not defined
Who can help?
No response
Information
- [ ] The official example notebooks/scripts
- [x] My own modified scripts
Related Components
- [ ] LLMs/Chat Models
- [ ] Embedding Models
- [ ] Prompts / Prompt Templates / Prompt Selectors
- [ ] Output Parsers
- [x] Document Loaders
- [ ] Vector Stores / Retrievers
- [ ] Memory
- [ ] Agents / Agent Executors
- [ ] Tools / Toolkits
- [ ] Chains
- [ ] Callbacks/Tracing
- [ ] Async
Reproduction
loader = DirectoryLoader(directory_path, glob='*/.pdf') documents = loader.load()
Expected behavior
Loaded documents.
pip install unstructured==0.7.12
System Info
I have been running this code for weeks, and today it looks like something changed to break it. I'm using the following code snippets...
from langchain.document_loaders import DirectoryLoader ... loader = DirectoryLoader(directory_path, glob='*/.pdf') documents = loader.load()
This is the error I am getting...
NameError Traceback (most recent call last) in <cell line: 2>()
1 loader = DirectoryLoader(directory_path, glob='*/.pdf')
----> 2 documents = loader.load()
3 print("Number of documents: ", len(documents))
4
5 timestampit()
5 frames /usr/local/lib/python3.10/dist-packages/unstructured/partition/auto.py in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, pdf_infer_table_structure, xml_keep_tags, data_source_metadata, **kwargs) 219 ) 220 elif filetype == FileType.PDF: --> 221 elements = partition_pdf( 222 filename=filename, # type: ignore 223 file=file, # type: ignore
NameError: name 'partition_pdf' is not defined
Who can help?
No response
Information
Related Components
Reproduction
loader = DirectoryLoader(directory_path, glob='*/.pdf') documents = loader.load()
Expected behavior
Loaded documents.