Got error when running partition_pdf

sigurn2 commented 6 months ago

I actually run your code: 01_semi_structured_data.ipynb in collab

from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
    filename="statement_of_changes.pdf",
    extract_images_in_pdf=False,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=".",
)

and got error shows

WARNING:unstructured:This function will be deprecated in a future release and `unstructured` will simply use the DEFAULT_MODEL from `unstructured_inference.model.base` to set default model name
---------------------------------------------------------------------------
UnidentifiedImageError                    Traceback (most recent call last)
[<ipython-input-10-c47946c825bc>](https://localhost:8080/#) in <cell line: 6>()
      4 from unstructured.partition.pdf import partition_pdf
      5 
----> 6 raw_pdf_elements = partition_pdf(
      7     filename="statement_of_changes.pdf",
      8     extract_images_in_pdf=False,

10 frames
[/usr/local/lib/python3.10/dist-packages/PIL/Image.py](https://localhost:8080/#) in open(fp, mode, formats)
   3281         fp.seek(0)
   3282     except (AttributeError, io.UnsupportedOperation):
-> 3283         fp = io.BytesIO(fp.read())
   3284         exclusive_fp = True
   3285 

UnidentifiedImageError: cannot identify image file '/tmp/tmpt9l2pd51/88be9f82-5a19-4ec0-baa1-a029cf45dfc4-1.ppm'

I have no idea how to resolve it.

zh-Wang286 commented 4 months ago

I actually ran your code: 01_semi_structured_data.ipynb in VSCode

raw_pdf_elements = partition_pdf(
    filename="statement_of_changes.pdf",
    extract_images_in_pdf=False,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=".",
)

and got the following error: LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on. I am using a server located in mainland China and cannot directly access Hugging Face. I noticed that the default model for detecting tables is "unstructuredio/yolo_x_layout".

I tried manually downloading the model with infer_table_structure=True and saved it in the path ~/.cache/huggingface/hub/models--unstructuredio--yolo_x_layout/blobs/yolox_l0.05.onnx. However, running the program still results in a LocalEntryNotFoundError.

I also tried specifying the local path manually:

hf_hub_download(
repo_id="unstructuredio/yolo_x_layout",
filename="yolox_l0.05.onnx",
local_dir="/home/adminsiyu/.cache/huggingface/hub/models--unstructuredio--yolo_x_layout/blobs/yolox/yolox_l0.05"
),

but still encountered the LocalEntryNotFoundError.

Could you please advise on how to resolve this issue?

sigurn2 commented 4 months ago

I actually ran your code: 01_semi_structured_data.ipynb in VSCode
raw_pdf_elements = partition_pdf(
    filename="statement_of_changes.pdf",
    extract_images_in_pdf=False,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=".",
)
and got the following error: LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on. I am using a server located in mainland China and cannot directly access Hugging Face. I noticed that the default model for detecting tables is "unstructuredio/yolo_x_layout".

I tried manually downloading the model with infer_table_structure=True and saved it in the path ~/.cache/huggingface/hub/models--unstructuredio--yolo_x_layout/blobs/yolox_l0.05.onnx. However, running the program still results in a LocalEntryNotFoundError.

I also tried specifying the local path manually:
hf_hub_download(
    repo_id="unstructuredio/yolo_x_layout",
    filename="yolox_l0.05.onnx",
    local_dir="/home/adminsiyu/.cache/huggingface/hub/models--unstructuredio--yolo_x_layout/blobs/yolox/yolox_l0.05"
),
but still encountered the LocalEntryNotFoundError.

Could you please advise on how to resolve this issue?

I actually ran your code: 01_semi_structured_data.ipynb in VSCode
raw_pdf_elements = partition_pdf(
    filename="statement_of_changes.pdf",
    extract_images_in_pdf=False,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=".",
)
and got the following error: LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on. I am using a server located in mainland China and cannot directly access Hugging Face. I noticed that the default model for detecting tables is "unstructuredio/yolo_x_layout".

I tried manually downloading the model with infer_table_structure=True and saved it in the path ~/.cache/huggingface/hub/models--unstructuredio--yolo_x_layout/blobs/yolox_l0.05.onnx. However, running the program still results in a LocalEntryNotFoundError.

I also tried specifying the local path manually:
hf_hub_download(
    repo_id="unstructuredio/yolo_x_layout",
    filename="yolox_l0.05.onnx",
    local_dir="/home/adminsiyu/.cache/huggingface/hub/models--unstructuredio--yolo_x_layout/blobs/yolox/yolox_l0.05"
),
but still encountered the LocalEntryNotFoundError.

Could you please advise on how to resolve this issue?

找个ubuntu机器跑一下，他代码有点老了，huggingface镜像用hf-mirror加速一下，直接搜有教程，这个代码能跑

sugarforever / Advanced-RAG

Got error when running partition_pdf #2