Closed cmin764 closed 1 year ago
camelot
library uses the old discontinued PyPDF2
(1.x) and we're compatible now with the new pypdf
(3.x) only; due to security issues we already solved, we won't go back to the outdated one (nor use both).pdftopng>=0.2.3
which doesn't have wheels in PyPI. (see Issue; and related)camelot
library requires cv2
, numpy
and pandas
in order to be able to parse PDFs. (and we include cv2
(opencv-python) in rpaframework-recognition
only) -- all these deps might be quite heavy for introducing them into the rpaframework-pdf
which is included automatically in rpaframework
robocorp-camelot
: proper samever requirements pins, package in PyPI, maintenance with the upstreampypdf
dep and to disable the pdftopng
usage as it looks it is used in testing only.numpy
and pandas
dependencies if we really don't rely on those to get the raw table data out. (needs Camelot codebase understanding, research and testing in order to decouple those from usage)Might take at least one week, as I can't be sure about all the challenges prior developing and testing.
Due to the complexity of integrating this directly into the library, it was decided not to do so. At most we'll be introducing this as an extension by a Portal example: https://github.com/robocorp/rpaframework/issues/790
With the current library, the
Get Text From PDF
keyword barely outputs correct text with a similar format to what we actually see in the PDF.So a new keyword for extracting the tables with Camelot will cover those edge cases asking for this.
Get Table From PDF
~As an additional improvement (& alternative), we can add OCR support to
Get Text From PDF
and ifocr=${True}
is passed, then a screenshot of the PDF is captured (or simply converted to an image), then with the help of ourrpaframework-recognition
we'll extract the text so the final output will be in the same format as seen.~Slack thread