robocorp / rpaframework

Collection of open-source libraries and tools for Robotic Process Automation (RPA), designed to be used with both Robot Framework and Python
https://www.rpaframework.org/
Apache License 2.0
1.17k stars 227 forks source link

`RPA.PDF`: Extracting tables from text PDF (new keyword) #721

Closed cmin764 closed 1 year ago

cmin764 commented 1 year ago

With the current library, the Get Text From PDF keyword barely outputs correct text with a similar format to what we actually see in the PDF.

So a new keyword for extracting the tables with Camelot will cover those edge cases asking for this.


~As an additional improvement (& alternative), we can add OCR support to Get Text From PDF and if ocr=${True} is passed, then a screenshot of the PDF is captured (or simply converted to an image), then with the help of our rpaframework-recognition we'll extract the text so the final output will be in the same format as seen.~

Slack thread

cmin764 commented 1 year ago

Challenges on adding camelot-py into the library

  1. The camelot library uses the old discontinued PyPDF2 (1.x) and we're compatible now with the new pypdf (3.x) only; due to security issues we already solved, we won't go back to the outdated one (nor use both).
  2. Both development and base installations rely on a non-existing version of pdftopng>=0.2.3 which doesn't have wheels in PyPI. (see Issue; and related)
  3. The camelot library requires cv2, numpy and pandas in order to be able to parse PDFs. (and we include cv2 (opencv-python) in rpaframework-recognition only) -- all these deps might be quite heavy for introducing them into the rpaframework-pdf which is included automatically in rpaframework

Potential solution

Might take at least one week, as I can't be sure about all the challenges prior developing and testing.

cmin764 commented 1 year ago

Due to the complexity of integrating this directly into the library, it was decided not to do so. At most we'll be introducing this as an extension by a Portal example: https://github.com/robocorp/rpaframework/issues/790