lebebr01 / pdfsearch

Search pdf files for keywords.
Other
39 stars 5 forks source link

enhancing this package with "OCR" and "translation" #21

Open behrica opened 3 years ago

behrica commented 3 years ago

We have made experiments using 2 commercial APIs from Azure to OCR scanned pdfs and translate them into English , if not English. They work in my opinion "well enough" for doing keyword search in the results.

I will try to integrated this functionality in this package for our purposes and work on it in this on this fork: https://github.com/openefsa/pdfsearch

I could potentially contribute this as open source here, if you are interested. Of course, a potential user, needed to bring his own Azure API token in order to use the functionality.

In case, you do not want to couple the package too much to a commercial provide such as Azure, it might be useful to have 2 extension points on this package, which allow to "plugin"

lebebr01 commented 3 years ago

Thanks for looking at this. I think if integrated I'd prefer the latter approach to be more agnostic. For example, could use azure or tesseract OCR (which has been open-sourced). I have not followed as closely with potential translations that are open source, but if there are options, being flexible would be useful I think.

If you want to submit a PR that implements the azure piece that would be incredibly helpful. I could add the tesseract approach and generalize the implementation to use whichever the user wishes.

behrica commented 3 years ago

I have a working implementation.

There is one piece of code, which could be made agnostic, having these concept:

As the Azure APIs (2, one for OCR, one for translation), I need top pass in credentials in some form.

I have the credentials "hardcoded" as function parameters, but we should do this differently.

As my implementation calls slow / expensive APIs, I implemented as well caching via memoization (but this is a implementation detail od Azure)

behrica commented 3 years ago

I am not an expert in R. Is there a standard concept in R, of "extension" points ? It is just via "passing a function" into an other function?

behrica commented 3 years ago

Thanks for looking at this. I think if integrated I'd prefer the latter approach to be more agnostic. For example, could use azure or tesseract OCR (which has been open-sourced). I have not followed as closely with potential translations that are open source, but if there are options, being flexible would be useful I think.

If you want to submit a PR that implements the azure piece that would be incredibly helpful. I could add the tesseract approach and generalize the implementation to use whichever the user wishes.

I am now "ready" for our internal usage. My colleges (non technicians) have now an very easy way to search in: PDFs, independent if having extractable text or are scanned and/or non-english.

I changed the existing code of keyword_search slightly, into three directions:

  1. extension point to plugin a "OCR function"
  2. extension point to plugin a "translate" function
  3. some simple logic to decide if 1) and 2) should be called, depending if:
    • pdf_text returns "empty" (if < 100 characters)
    • a language detector (franc package), which decides if current text is already in a target language

The code has now as well an "azure based" implementation of the 2 extension points This is "quick and dirty", but for us very useful Its "biggest task" is to chunk the text in small enough pieces, so that the API of azure accepts them. The OCR Api is a push-task-and-poll-status type of API, so I implement as well the "waiting for a result".

I would be happy to collaborate with you further to move this into the upstream version of teh package

behrica commented 3 years ago

@lebebr01 please let me know, if you want me to do anything on the #23

lebebr01 commented 3 years ago

Thanks, @behrica. I'll take a look more closely soon. Likely won't be for at least a week or so, I need to get through the end of the semester here first.