huridocs / pdf-document-layout-analysis

A Docker-powered service for PDF document layout analysis. This service provides a powerful and flexible PDF analysis service. The service allows for the segmentation and classification of different parts of PDF pages, identifying the elements such as texts, titles, pictures, tables and so on.
Apache License 2.0
115 stars 13 forks source link

windows file upload curl error: 'NoneType' object has no attribute 'file_name' #13

Closed dragon0311 closed 4 weeks ago

dragon0311 commented 4 months ago

name: windows file upload curl error: 'NoneType' object has no attribute 'file_name' about: windows file upload curl error: 'NoneType' object has no attribute 'file_name' title: '' labels: bug assignees: ''


Describe the bug 2024-05-31 13:50:50 pdf-document-layout-analysis | Page-27 2024-05-31 13:50:50 pdf-document-layout-analysis | Page-28 2024-05-31 13:50:50 pdf-document-layout-analysis | Page-29 2024-05-31 13:50:50 pdf-document-layout-analysis | Page-30 2024-05-31 13:50:50 pdf-document-layout-analysis | Page-31 2024-05-31 13:50:50 pdf-document-layout-analysis | 2024-05-31 05:50:50,320 [ERROR] Error 2024-05-31 13:50:50 pdf-document-layout-analysis | Traceback (most recent call last): 2024-05-31 13:50:50 pdf-document-layout-analysis | File "/app/src/app.py", line 24, in run 2024-05-31 13:50:50 pdf-document-layout-analysis | return analyze_pdf(file.file.read()) 2024-05-31 13:50:50 pdf-document-layout-analysis | File "/app/src/analyze_pdf.py", line 55, in analyze_pdf 2024-05-31 13:50:50 pdf-document-layout-analysis | pdf_images_list: list[PdfImages] = [PdfImages.from_pdf_path(pdf_path)] 2024-05-31 13:50:50 pdf-document-layout-analysis | File "/app/src/PdfImages.py", line 45, in from_pdf_path 2024-05-31 13:50:50 pdf-document-layout-analysis | pdf_features.file_name = pdf_name 2024-05-31 13:50:50 pdf-document-layout-analysis | AttributeError: 'NoneType' object has no attribute 'file_name'

To Reproduce Steps to reproduce the behavior:

  1. Upload pdf on windows using postman request【curl --location 'localhost:5060' \ --form 'file=@"/D:/文件/华信百科/30-中国通服华信设计〔2021〕118号 关于印发《华信咨询设计研究院有限公司青年英才计划实施方案(试行稿)》的通知.pdf"'】
  2. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots image image

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

dragon0311 commented 4 months ago

@gabriel-piles In Need of a Mentor's Advice

ali6parmak commented 4 months ago

Hello,

This error you are experiencing generally relates to the PDF file - it can be corrupted/broken or there could be a problem with its path. So,

Please inform us about the outcomes!

dragon0311 commented 4 months ago

image image I've replaced it with another pdf file, but it's still reporting the same error. Is there any reason why windows file path formats are different from linux?

ali6parmak commented 4 months ago

Hi,

I tested the service on a Windows machine using the "regular.pdf" in test_pdfs directory, and I'm able to get the results without any issues. Can you try running the service as shown in the README, using this command:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060

You might encounter an error like:

Invoke-WebRequest : A parameter cannot be found that matches parameter name 'X'.

If this happens, please use the full path of "curl" as follows:

C:\Windows\System32\curl.exe -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060
dragon0311 commented 4 months ago

When I replace the pdf file with test_pdfs/regular.pdf, the request returns correctly, which means there is nothing wrong with my request method or file path. So it's possible that my previous pdf format has something different . Is it possible that there are many different types of pdfs? image

ali6parmak commented 4 months ago

Hi,

To resolve the issue, we have updated one of the libraries within our project's requirements. To benefit from it, please pull the changes and run the service again. We believe this update should address the issue you encountered. If, after updating, you still encounter the same issue, please feel free to share the relevant PDF file and we can investigate it further to ensure everything is working as expected.

ssocean commented 3 months ago

我没有细看,但会不会是中文路径的问题?尤其是如果作者使用了opencv之类的库,对中文路径是不用好的。