ismail-mebsout / Parsing-PDFs-using-YOLOV3

Parsing pdf tables using YOLOV3
114 stars 42 forks source link
computer-vision pdf python

Parsing PDFs using YOLOV3

There exist many python librairies which enable the parsing of pdfs, Camelot is one of the best. Although it performs well on text, however, it struggles on tables specially the ones localized inside paragraphs.
Camelot offers the option of specifying the regions to parse through the variable table_areas="x1,y1,x2,y2" where (x1, y1) is left-top and (x2, y2) right-bottom in PDF coordinate space. When filled out, the result is significantly enhanced.

Explaining the basic idea

One way to automize the parsing of tables is to train an algorithm capable of returning the coordinates of the bounding boxe circling the table, as detailled in the following pipeline:

If the primitive pdf page is image-based, we can use ocrmypdf to turn into a text-based one in order to be able to get the text inside of the table. We, then, carry out the following operations:


When detecting a table in pdf image we expand the bounding boxe in order to guarante its full inclusion, as follows:

Tables detection

The algorithm which allows the detection of tables, is nothing but yolov3, I advise your to read my previous article about objects detection. We finetune the algorithm to detect tables and retrain all the architecture. To do so, we carry out the following steps:

Requirements

All python requirements are included in the file package.txt, all you need to do is run the following command line:

pip install -r packages.txt

Prediction

It is possible to make prediction on a pdf page using the following command line:

python predict_table.py --pdf_path pdfs/boeings.pdf --page 2

It takes two arguments:

Examples

NB: following the same steps, we can train the algorithms to detect any other object in a pdf page such as graphics and images which can be extracted from the image page.