Getting NameError: name 'partition_pdf' is not defined when running "documents = loader.load()"

botchagalupe commented 1 year ago

System Info

I have been running this code for weeks, and today it looks like something changed to break it. I'm using the following code snippets...

from langchain.document_loaders import DirectoryLoader ... loader = DirectoryLoader(directory_path, glob='*/.pdf') documents = loader.load()

This is the error I am getting...

NameError Traceback (most recent call last) in <cell line: 2>() 1 loader = DirectoryLoader(directory_path, glob='*/.pdf') ----> 2 documents = loader.load() 3 print("Number of documents: ", len(documents)) 4 5 timestampit()

5 frames /usr/local/lib/python3.10/dist-packages/unstructured/partition/auto.py in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, pdf_infer_table_structure, xml_keep_tags, data_source_metadata, **kwargs) 219 ) 220 elif filetype == FileType.PDF: --> 221 elements = partition_pdf( 222 filename=filename, # type: ignore 223 file=file, # type: ignore

NameError: name 'partition_pdf' is not defined

Who can help?

No response

Information

[ ] The official example notebooks/scripts
[X] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[X] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

loader = DirectoryLoader(directory_path, glob='*/.pdf') documents = loader.load()

Expected behavior

Loaded documents.

dosubot[bot] commented 1 year ago

Answer generated by a 🤖

Answer

Hello,

Based on the error message you provided, it seems like the issue is related to the 'partition_pdf' function not being defined. This function is part of the Pdfminer library, which is used in the LangChain framework.

From the context provided, there was a similar issue reported in the past (Loading pdf files from directory gives the following error). The suggested solution was to switch your Python version from 3.10 to 3.9. This is because Pdfminer, which is used in the LangChain framework, was last updated in 2019 and works on Python 3.7, 3.8, and 3.9.

If you are currently using Python 3.10, I would recommend downgrading to Python 3.9 and see if that resolves the issue. You can do this by creating a new virtual environment with Python 3.9. Here's how you can do it:

python3.9 -m venv env
source env/bin/activate
pip install langchain==0.0.251

If you are already using Python 3.9 or lower and still encountering this issue, it would be helpful if you could provide more information about your environment, such as the exact version of LangChain you are using and any recent changes you made to your code or environment.

I hope this helps! Let me know if you have any other questions or if the issue persists.

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

botchagalupe commented 1 year ago

It's been running successfully on 3.10 for several weeks without any changes. I'm pretty sure this is an error related to a recent change in Langchain.

dosubot[bot] commented 1 year ago

Answer generated by a 🤖

Answer

Hello,

The error you're encountering, NameError: name 'partition_pdf' is not defined, is likely due to the unstructured library not being installed or not being in the Python path. The partition_pdf function is imported from the unstructured library in the UnstructuredPDFLoader class in your script.

To resolve this issue, you need to ensure that the unstructured library is installed and accessible in your Python environment. You can install it using pip:

pip install unstructured

If the library is already installed, make sure that it's in the Python path. You can check your Python path by running the following command in your Python environment:

import sys
print(sys.path)

If the path to the unstructured library is not in the output, you need to add it. You can do this by appending the path to the unstructured library to the sys.path list:

sys.path.append('/path/to/unstructured')

Replace '/path/to/unstructured' with the actual path to the unstructured library in your system. After doing this, you should be able to import the partition_pdf function and the NameError should be resolved.

To give you a bit more context, the 'partition_pdf' function is used in the 'UnstructuredPDFLoader' class in the LangChain codebase. This function is imported from the 'unstructured.partition.pdf' module. The purpose of this function is to split the PDF document into different elements such as Title and NarrativeText. This is used when the 'UnstructuredPDFLoader' is run in "elements" mode. The function takes the file path of the PDF document as an argument, along with any additional keyword arguments passed to the 'UnstructuredPDFLoader'.

I hope this helps! Let me know if you have any other questions or if there's anything else I can do to assist you.

Sources

libs/langchain/langchain/document_loaders/pdf.py

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

botchagalupe commented 1 year ago

unstructured is installed.

/usr/lib/python3.10# pip show unstructured Name: unstructured Version: 0.9.0 Summary: A library that prepares raw documents for downstream ML tasks. Home-page: https://github.com/Unstructured-IO/unstructured Author: Unstructured Technologies Author-email: devops@unstructuredai.io License: Apache-2.0 Location: /usr/local/lib/python3.10/dist-packages Requires: chardet, filetype, lxml, nltk, python-magic, requests, tabulate Required-by:

eyurtsev commented 1 year ago

Hi @botchagalupe,

If you have a full stack trace that would be more helpful (the one included in your previous response is abbreviated).

Based on that stack trace I'd check if this is an unstructured issue, since it's exception origins from unstructured complaining about being unable to resolve a function.

You could try upgrading/downgrading unstructured -- it's possible that unstructured got upgraded while bumping langchain version.

botchagalupe commented 1 year ago

Thanks For looking into this...

Does this help?

Error loading documents Traceback (most recent call last): File "", line 5, in <cell line: 3> documents = loader.load() File "/usr/local/lib/python3.10/dist-packages/langchain/document_loaders/directory.py", line 137, in load self.load_file(i, p, docs, pbar) File "/usr/local/lib/python3.10/dist-packages/langchain/document_loaders/directory.py", line 94, in load_file raise e File "/usr/local/lib/python3.10/dist-packages/langchain/document_loaders/directory.py", line 88, in load_file sub_docs = self.loader_cls(str(item), self.loader_kwargs).load() File "/usr/local/lib/python3.10/dist-packages/langchain/document_loaders/unstructured.py", line 86, in load elements = self._get_elements() File "/usr/local/lib/python3.10/dist-packages/langchain/document_loaders/unstructured.py", line 171, in _get_elements return partition(filename=self.file_path, self.unstructured_kwargs) File "/usr/local/lib/python3.10/dist-packages/unstructured/partition/auto.py", line 221, in partition elements = partition_pdf( NameError: name 'partition_pdf' is not defined

FirstGalacticEmpire commented 1 year ago

I'm experiencing exactly the same issue, which appeared after no code change, just update of langchain and I presume unstructured library.

FirstGalacticEmpire commented 1 year ago

After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.

botchagalupe commented 1 year ago

That worked it. Thank You.

johnda98 commented 1 year ago

doing the same.. trying to run a Langchain project.. uninstalling and trying % python3 -m pip install unstructured==0.7.12

TheFrancho commented 1 year ago

After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.

Downgrading unstructured worked so far, thx.

keiru517 commented 1 year ago

Thank you @FirstGalacticEmpire !

ghost commented 1 year ago

After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.

Worked for me too. Thanks so much !

keiru517 commented 1 year ago

Thanks

On Mon, Aug 7, 2023 at 5:38 AM siidny @.***> wrote:

After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.

Worked for me too. Thanks so much !

— Reply to this email directly, view it on GitHub https://github.com/langchain-ai/langchain/issues/8714#issuecomment-1667100737, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2KK3RQ7SLWI3XU3YKVWFCLXUBIIRANCNFSM6AAAAAA3DHVHNE . You are receiving this because you commented.Message ID: @.***>

pcschreiber1 commented 1 year ago

Also worked for me, thanks!

bobbyfongprivate commented 1 year ago

After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.

Worked for me too. Thanks so much !

tarockey commented 1 year ago

I was running into this issue - for me the problem was that unstructured has several dependencies for cracking PDFs, but it doesn't expose that very well. The import statement for partition_pdf has an if else check to see if pdf2image, pdfminer, and PIL are all installed, but if they aren't it doesn't throw an error, it simply doesn't import the module.

Once I installed pdf2image, pdfminer (and it's subpackage pdfminer.six), and pillows, everything ran as expected.

lhayana commented 1 year ago

I was running into this issue - for me the problem was that unstructured has several dependencies for cracking PDFs, but it doesn't expose that very well. The import statement for partition_pdf has an if else check to see if pdf2image, pdfminer, and PIL are all installed, but if they aren't it doesn't throw an error, it simply doesn't import the module.

Once I installed pdf2image, pdfminer (and it's subpackage pdfminer.six), and pillows, everything ran as expected.

this worked for me. Thanks!

hsnaderi commented 11 months ago

I'm getting this error in my jupyter notebook. I tried the above solution of putting in a specific version of unstructured that @FirstGalacticEmpire referenced. I also tried the additional packages that @tarockey suggested. Still nothing.

What's even stranger is there are no PDFs in the directory, so I wonder why it's even spending any time going through this. Below is the error I'm getting:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[15], line 2
      1 loader = DirectoryLoader("./articles","*.txt")
----> 2 index = VectorstoreIndexCreator().from_loaders([loader])

File ~/Library/Python/3.9/lib/python/site-packages/langchain/indexes/vectorstore.py:81, in VectorstoreIndexCreator.from_loaders(self, loaders)
     79 docs = []
     80 for loader in loaders:
---> 81     docs.extend(loader.load())
     82 return self.from_documents(docs)

File ~/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/directory.py:156, in DirectoryLoader.load(self)
    154 else:
    155     for i in items:
--> 156         self.load_file(i, p, docs, pbar)
    158 if pbar:
    159     pbar.close()

File ~/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/directory.py:105, in DirectoryLoader.load_file(self, item, path, docs, pbar)
    103         logger.warning(f"Error loading file {str(item)}: {e}")
    104     else:
--> 105         raise e
    106 finally:
    107     if pbar:

File ~/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/directory.py:99, in DirectoryLoader.load_file(self, item, path, docs, pbar)
     97 try:
     98     logger.debug(f"Processing file: {str(item)}")
---> 99     sub_docs = self.loader_cls(str(item), **self.loader_kwargs).load()
    100     docs.extend(sub_docs)
    101 except Exception as e:

File ~/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/unstructured.py:86, in UnstructuredBaseLoader.load(self)
     84 def load(self) -> List[Document]:
     85     """Load file."""
---> 86     elements = self._get_elements()
     87     self._post_process_elements(elements)
     88     if self.mode == "elements":

File ~/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/unstructured.py:172, in UnstructuredFileLoader._get_elements(self)
    169 def _get_elements(self) -> List:
    170     from unstructured.partition.auto import partition
--> 172     return partition(filename=self.file_path, **self.unstructured_kwargs)

File ~/Library/Python/3.9/lib/python/site-packages/unstructured/partition/auto.py:361, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, xml_keep_tags, data_source_metadata, metadata_filename, **kwargs)
    352     elements = _partition_md(
    353         filename=filename,
    354         file=file,
   (...)
    358         **kwargs,
    359     )
    360 elif filetype == FileType.PDF:
--> 361     _partition_pdf = _get_partition_with_extras("pdf")
    362     elements = _partition_pdf(
    363         filename=filename,  # type: ignore
    364         file=file,  # type: ignore
   (...)
    370         **kwargs,
    371     )
    372 elif (filetype == FileType.PNG) or (filetype == FileType.JPG) or (filetype == FileType.TIFF):

File ~/Library/Python/3.9/lib/python/site-packages/unstructured/partition/auto.py:113, in _get_partition_with_extras(doc_type, partition_with_extras_map)
    111 _partition_func = partition_with_extras_map.get(doc_type)
    112 if _partition_func is None:
--> 113     raise ImportError(
    114         f"partition_{doc_type} is not available. "
    115         f"Install the {doc_type} dependencies with "
    116         f'pip install "unstructured[{doc_type}]"',
    117     )
    118 return _partition_func

ImportError: partition_pdf is not available. Install the pdf dependencies with pip install "unstructured[pdf]"`

hsnaderi commented 11 months ago

I restarted my kernal and this particular error disappeared (but I'm getting other dependency issues).

I was wondering why this is showing up as an error in my jupyter notebook but I have no issues when running the same code in pycharm

hsnaderi commented 11 months ago

just FYI latest error is

ImportError: cannot import name 'PDFResourceManager' from 'pdfminer.converter' (/Users/xxxyyyzzz/Library/Python/3.9/lib/python/site-packages/pdfminer/converter.py)

truongquoc commented 9 months ago

After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.

Really helpful. Thanks for your comment

dionis commented 9 months ago

Terrific, work form me too, please fix unstructured issue.

SaishSaw commented 8 months ago

I am currently building a chatbot from pdf files. I was having same issue. One of the way i found was using pypdf So langchain.document_loaders has PyPDFLoader for single files and PyPDFDirectoryLoader for accessing multiple files in same folder. This helps in loading the files without showing partition_pdf error. Hope this helps.

mlc-mlapis commented 8 months ago

It's a problem related to the LangChain 0.1.3 (latest) when using Unstructured 0.12.2 (latest). Everything works correctly when using partition_pdf separately, so it's installed and available.

from unstructured.partition.pdf import partition_pdf
test_elements = partition_pdf("xxx.pdf")

dosubot[bot] commented 8 months ago

🤖

Thank you for providing the full stack trace. It seems like the issue is indeed related to the unstructured library. The partition_pdf function is not being recognized, which is causing the NameError.

The partition_pdf function is part of the unstructured library and is used to partition PDF documents into different elements. This function is used in the UnstructuredPDFLoader class in LangChain, specifically in the _get_elements method.

Here is the relevant code from the UnstructuredPDFLoader class:

def _get_elements(self) -> List:
    from unstructured.partition.pdf import partition_pdf
    return partition_pdf(filename=self.file_path, **self.unstructured_kwargs)

As you can see, the partition_pdf function is imported from the unstructured.partition.pdf module and is used to partition the PDF document.

Given that the unstructured library is installed in your environment (as confirmed by the output of pip show unstructured), it's possible that there might be an issue with the specific version of the unstructured library you have installed.

You could try downgrading or upgrading the unstructured library to see if that resolves the issue. Here are the commands to do so:

To upgrade:

pip install unstructured --upgrade

To downgrade:

pip install unstructured==<previous_version>

Replace <previous_version> with the version number of the unstructured library that was working for you.

If the issue persists after trying these steps, it would be helpful to know the exact version of LangChain and unstructured you're using. This will help in further diagnosing the issue.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

mlc-mlapis commented 8 months ago

@dosu-bot As I wrote above, I am using the latest versions as LangChain 0.1.3 (latest) and Unstructured 0.12.2 (latest). The environment is Python 3.10.5. Because if you import from unstructured.partition.pdf import partition_pdf separately (outside of the from langchain.document_loaders import DirectoryLoader), it works correctly, I tried to explain such a behavior.

The error message ImportError: partition_pdf is not available. Install the pdf dependencies with pip install "unstructured[pdf]" comes from the call _partition_pdf = _get_partition_with_extras("pdf") of the unstructured > partition > auto.py code.

def _get_partition_with_extras(
    doc_type: str,
    partition_with_extras_map: Optional[Dict[str, Callable]] = None,
):
    if partition_with_extras_map is None:
        partition_with_extras_map = PARTITION_WITH_EXTRAS_MAP
    _partition_func = partition_with_extras_map.get(doc_type)
    if _partition_func is None:
        raise ImportError(
            f"partition_{doc_type} is not available. "
            f"Install the {doc_type} dependencies with "
            f'pip install "unstructured[{doc_type}]"',
        )
    return _partition_func

and

pdf_imports = ["pdf2image", "pdfminer", "PIL"]
if all(dependency_exists(dep) for dep in pdf_imports):
    from unstructured.partition.pdf import partition_pdf

    PARTITION_WITH_EXTRAS_MAP["pdf"] = partition_pdf

There are required ["pdf2image", "pdfminer," "PIL"] dependencies, and one of them, the PIL library, stands for Python Imaging Library, and it's the original library that enabled Python to deal with images. PIL was discontinued in 2011 and only supported Python 2.

That also explains the problem because PIL can't be installed in the Python v3 environment. The Pillow library should probably replace it. That dependency condition is not applied when you import from unstructured.partition.pdf import partition_pdf separately.

I don't know the complete history of the unstructured library, but it seems that its latest version doesn't support Python v3.

mlc-mlapis commented 8 months ago

As I described in my previous comment, it seems that the problem is on the Unstructured library side, and here is the issue I have created in its repository: https://github.com/Unstructured-IO/unstructured/issues/2456. It would be nice of you to support a solution to the problem and express your opinion there. Finally, some consequences affect the LangChain side also.

raksrivastava commented 6 months ago

After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.

Getting below error on installing this version. register_finder(pkgutil.ImpImporter, find_on_path) ^^^^^^^^^^^^^^^^^^^ AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'? [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> See above for output. note: This error originates from a subprocess, and is likely not a problem with pip.

infogain-rishabh commented 6 months ago

After further experiments, I think the error is caused by the new releases of unstructured. Downgrading to: unstructured==0.7.12 Solved the issue. Waiting for the official solution.

Getting below error on installing this version. register_finder(pkgutil.ImpImporter, find_on_path) ^^^^^^^^^^^^^^^^^^^ AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'? [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> See above for output. note: This error originates from a subprocess, and is likely not a problem with pip.

I am also encountering the same error while installing the unstructured==0.7.12. Any suggestions on this?

wahidpanda commented 1 month ago

System Info

I have been running this code for weeks, and today it looks like something changed to break it. I'm using the following code snippets...

from langchain.document_loaders import DirectoryLoader ... loader = DirectoryLoader(directory_path, glob='*/.pdf') documents = loader.load()

This is the error I am getting...

NameError Traceback (most recent call last) in <cell line: 2>() 1 loader = DirectoryLoader(directory_path, glob='*/.pdf') ----> 2 documents = loader.load() 3 print("Number of documents: ", len(documents)) 4 5 timestampit()

5 frames /usr/local/lib/python3.10/dist-packages/unstructured/partition/auto.py in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, pdf_infer_table_structure, xml_keep_tags, data_source_metadata, **kwargs) 219 ) 220 elif filetype == FileType.PDF: --> 221 elements = partition_pdf( 222 filename=filename, # type: ignore 223 file=file, # type: ignore

NameError: name 'partition_pdf' is not defined

Who can help?

No response

Information

[ ] The official example notebooks/scripts

[x] My own modified scripts

Related Components

[ ] LLMs/Chat Models

[ ] Embedding Models

[ ] Prompts / Prompt Templates / Prompt Selectors

[ ] Output Parsers

[x] Document Loaders

[ ] Vector Stores / Retrievers

[ ] Memory

[ ] Agents / Agent Executors

[ ] Tools / Toolkits

[ ] Chains

[ ] Callbacks/Tracing

[ ] Async

Reproduction

loader = DirectoryLoader(directory_path, glob='*/.pdf') documents = loader.load()

Expected behavior

Loaded documents.

pip install unstructured==0.7.12

langchain-ai / langchain