How to run Semi_structured_and_multi_modal_RAG.ipynb

ruanwz commented 11 months ago

When running on colab, it has following error:

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-2-3092bb1466a5>](https://localhost:8080/#) in <cell line: 7>()
      5 
      6 # Get elements
----> 7 raw_pdf_elements = partition_pdf(filename=path+"LLaVA.pdf",
      8                                  # Using pdf format to find embedded image blocks
      9                                  extract_images_in_pdf=True,

8 frames
[/usr/local/lib/python3.10/dist-packages/unstructured_inference/models/base.py](https://localhost:8080/#) in get_model(model_name, **kwargs)
     70     else:
     71         raise UnknownModelException(f"Unknown model type: {model_name}")
---> 72     model.initialize(**initialize_params)
     73     models[model_name] = model
     74     return model

TypeError: UnstructuredYoloXModel.initialize() got an unexpected keyword argument 'extract_images_in_pdf'

https://github.com/langchain-ai/langchain/blob/5dbe456aae755e3190c46316102e772dfcb6e148/cookbook/Semi_structured_and_multi_modal_RAG.ipynb#L103

dosubot[bot] commented 11 months ago

🤖

Hello,

The error you're encountering is due to the fact that the extract_images_in_pdf argument is not a valid argument for the UnstructuredYoloXModel.initialize() method. However, in the provided context, the extract_images_in_pdf argument is being used in the partition_pdf function from the unstructured.partition.pdf module, not in the UnstructuredYoloXModel.initialize() method.

Here is the relevant code snippet:

from unstructured.partition.pdf import partition_pdf

# Get elements
raw_pdf_elements = partition_pdf(filename=path+"LLaVA.pdf",
                                 extract_images_in_pdf=True,
                                 infer_table_structure=True, 
                                 chunking_strategy="by_title",
                                 max_characters=4000, 
                                 new_after_n_chars=3800, 
                                 combine_text_under_n_chars=2000,
                                 image_output_dir_path=path)

The extract_images_in_pdf argument is used to specify whether to extract images from the PDF document. If you're encountering an error related to this argument, it's possible that the version of the unstructured library you're using does not support this argument.

To resolve this issue, you can try updating the unstructured library to the latest version. If the problem persists, you may need to check the documentation or source code of the unstructured library to see if the extract_images_in_pdf argument is supported in the partition_pdf function. If it's not supported, you may need to remove this argument from your function call.

Regarding the partition_pdf function in the LangChain codebase, it is used to split a PDF document into different elements such as Title and NarrativeText. The function takes a filename and additional arguments for different unstructured settings. The function is called in the _get_elements method of the UnstructuredPDFLoader class.

Here is the relevant code snippet:

class UnstructuredPDFLoader(UnstructuredFileLoader):
    """Load `PDF` files using `Unstructured`.

    You can run the loader in one of two modes: "single" and "elements".
    If you use "single" mode, the document will be returned as a single
    langchain Document object. If you use "elements" mode, the unstructured
    library will split the document into elements such as Title and NarrativeText.
    You can pass in additional unstructured kwargs after mode to apply
    different unstructured settings.

    Examples
    --------
    from langchain.document_loaders import UnstructuredPDFLoader

    loader = UnstructuredPDFLoader(
        "example.pdf", mode="elements", strategy="fast",
    )
    docs = loader.load()

    References
    ----------
    https://unstructured-io.github.io/unstructured/bricks.html#partition-pdf
    """

    def _get_elements(self) -> List:
        from unstructured.partition.pdf import partition_pdf

        return partition_pdf(filename=self.file_path, **self.unstructured_kwargs)

In the context of the LangChain codebase, the partition_pdf function is used to split a PDF document into different elements when the UnstructuredPDFLoader is run in "elements" mode.

I hope this helps! If you have any other questions, feel free to ask.

Best, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] commented 8 months ago

Hi, @ruanwz,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. The issue you raised was about a TypeError when running the "Semi_structured_and_multi_modal_RAG.ipynb" notebook on Google Colab, specifically related to the UnstructuredYoloXModel and the "extract_images_in_pdf" parameter. It seems that the issue is still unresolved, and there hasn't been any recent activity or updates on it.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you for your understanding and contribution to LangChain!

langchain-ai / langchain

How to run Semi_structured_and_multi_modal_RAG.ipynb #12096

Sources