Closed ruanwz closed 7 months ago
🤖
Hello,
The error you're encountering is due to the fact that the extract_images_in_pdf
argument is not a valid argument for the UnstructuredYoloXModel.initialize()
method. However, in the provided context, the extract_images_in_pdf
argument is being used in the partition_pdf
function from the unstructured.partition.pdf
module, not in the UnstructuredYoloXModel.initialize()
method.
Here is the relevant code snippet:
from unstructured.partition.pdf import partition_pdf
# Get elements
raw_pdf_elements = partition_pdf(filename=path+"LLaVA.pdf",
extract_images_in_pdf=True,
infer_table_structure=True,
chunking_strategy="by_title",
max_characters=4000,
new_after_n_chars=3800,
combine_text_under_n_chars=2000,
image_output_dir_path=path)
The extract_images_in_pdf
argument is used to specify whether to extract images from the PDF document. If you're encountering an error related to this argument, it's possible that the version of the unstructured
library you're using does not support this argument.
To resolve this issue, you can try updating the unstructured
library to the latest version. If the problem persists, you may need to check the documentation or source code of the unstructured
library to see if the extract_images_in_pdf
argument is supported in the partition_pdf
function. If it's not supported, you may need to remove this argument from your function call.
Regarding the partition_pdf
function in the LangChain codebase, it is used to split a PDF document into different elements such as Title and NarrativeText. The function takes a filename and additional arguments for different unstructured settings. The function is called in the _get_elements
method of the UnstructuredPDFLoader
class.
Here is the relevant code snippet:
class UnstructuredPDFLoader(UnstructuredFileLoader):
"""Load `PDF` files using `Unstructured`.
You can run the loader in one of two modes: "single" and "elements".
If you use "single" mode, the document will be returned as a single
langchain Document object. If you use "elements" mode, the unstructured
library will split the document into elements such as Title and NarrativeText.
You can pass in additional unstructured kwargs after mode to apply
different unstructured settings.
Examples
--------
from langchain.document_loaders import UnstructuredPDFLoader
loader = UnstructuredPDFLoader(
"example.pdf", mode="elements", strategy="fast",
)
docs = loader.load()
References
----------
https://unstructured-io.github.io/unstructured/bricks.html#partition-pdf
"""
def _get_elements(self) -> List:
from unstructured.partition.pdf import partition_pdf
return partition_pdf(filename=self.file_path, **self.unstructured_kwargs)
In the context of the LangChain codebase, the partition_pdf
function is used to split a PDF document into different elements when the UnstructuredPDFLoader
is run in "elements" mode.
I hope this helps! If you have any other questions, feel free to ask.
Best, Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Hi, @ruanwz,
I'm helping the LangChain team manage their backlog and am marking this issue as stale. The issue you raised was about a TypeError when running the "Semi_structured_and_multi_modal_RAG.ipynb" notebook on Google Colab, specifically related to the UnstructuredYoloXModel and the "extract_images_in_pdf" parameter. It seems that the issue is still unresolved, and there hasn't been any recent activity or updates on it.
Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you for your understanding and contribution to LangChain!
When running on colab, it has following error:
https://github.com/langchain-ai/langchain/blob/5dbe456aae755e3190c46316102e772dfcb6e148/cookbook/Semi_structured_and_multi_modal_RAG.ipynb#L103