`RPA.PDF`: Extracting tables from text PDF (new keyword)

cmin764 commented 1 year ago

With the current library, the Get Text From PDF keyword barely outputs correct text with a similar format to what we actually see in the PDF.

So a new keyword for extracting the tables with Camelot will cover those edge cases asking for this.

Potential keyword name: Get Table From PDF
This will work with text-based PDFs only (not scanned images -- see the alternative below for images)

~As an additional improvement (& alternative), we can add OCR support to Get Text From PDF and if ocr=${True} is passed, then a screenshot of the PDF is captured (or simply converted to an image), then with the help of our rpaframework-recognition we'll extract the text so the final output will be in the same format as seen.~

Slack thread

cmin764 commented 1 year ago

Challenges on adding camelot-py into the library

The camelot library uses the old discontinued PyPDF2 (1.x) and we're compatible now with the new pypdf (3.x) only; due to security issues we already solved, we won't go back to the outdated one (nor use both).
Both development and base installations rely on a non-existing version of pdftopng>=0.2.3 which doesn't have wheels in PyPI. (see Issue; and related)
The camelot library requires cv2, numpy and pandas in order to be able to parse PDFs. (and we include cv2 (opencv-python) in rpaframework-recognition only) -- all these deps might be quite heavy for introducing them into the rpaframework-pdf which is included automatically in rpaframework

Potential solution

Fork the library into our own robocorp-camelot: proper samever requirements pins, package in PyPI, maintenance with the upstream
Refactoring the fork to use the latest pypdf dep and to disable the pdftopng usage as it looks it is used in testing only.
Dropping the numpy and pandas dependencies if we really don't rely on those to get the raw table data out. (needs Camelot codebase understanding, research and testing in order to decouple those from usage)

Might take at least one week, as I can't be sure about all the challenges prior developing and testing.

cmin764 commented 1 year ago

Due to the complexity of integrating this directly into the library, it was decided not to do so. At most we'll be introducing this as an extension by a Portal example: https://github.com/robocorp/rpaframework/issues/790

robocorp / rpaframework

`RPA.PDF`: Extracting tables from text PDF (new keyword) #721

Challenges on adding camelot-py into the library

Potential solution