huridocs / pdf-text-extraction

This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.
Apache License 2.0
18 stars 0 forks source link

PDF Text Extraction

A Docker-powered service for extracting text from PDF documents


This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.

You can check the pdf-document-layout-analysis service from here:

https://github.com/huridocs/pdf-document-layout-analysis

Quick Start

Start the service:

# With GPU support
make start

# Without GPU support [if you do not have a GPU on your system]
make start_no_gpu

Get the segments from a PDF:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080

Get only the text:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text

To stop the server:

make stop

Contents

Dependencies

Requirements

Usage

As we mentioned at the Quick Start, you can use the service simply like this:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080

This will directly return the analysis results from the pdf-document-layout-analysis service. The output will return a list of SegmentBox elements and each SegmentBox element has this shape:

    {
        "left": Left position of the segment
        "top": Top position of the segment
        "width": Width of the segment
        "height": Height of the segment
        "page_number": Page number which the segment belongs to
        "text": Text inside the segment
        "type": Type of the segment (one of the categories mentioned above)
    }

But you can also pass the types of the SegmentBoxes which you want to extract like:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080 -F "types=text title section_header list_item"

These are the types you can pass:

   "Caption"
   "Footnote"
   "Formula"
   "List_Item"
   "Page_Footer"
   "Page_Header"
   "Picture"
   "Section_Header"
   "Table"
   "Text"
   "Title"

If you only want to get the contents in a single string, you can use this command:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text

This will only return the content information. Similarly, you can pass the types of the text you want to extract:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text -F "types=text title section_header list_item"

Also, if you want to get the results faster (but with slightly worse performance) you can run this command:,

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/fast

For getting only the contents with the fast method:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text_fast

You can pass the types to these endpoints too:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/fast -F "types=text section_header list_item"

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text_fast -F "types=text title"

For more information about models and this fast method, check this link.

And to stop the server, you can simply use this:

make stop