rufuspollock / ideas

Ideas for (tech) stuff to research, build or work on.
https://rufuspollock.com/
50 stars 4 forks source link

Opensource OCR Service (PDF / TIFF / Scan to Text Conversion Service) #88

Closed rufuspollock closed 8 years ago

rufuspollock commented 10 years ago

Originally: http://ideas.okfn.org/ideas/106/pdf-tiff-scan-to-text-conversion-service/

Note: for generic PDF to text (including but not necessarily OCR) - see #52 (simple pdf to text service)

Quote from Tim:

Last weekend, I created an OCR pipeline with OCRopus, Tesseract & Celery/RabbitMQ. I need to do a little bit of work to make it available as a web service.

OCRopus does layout analysis, splitting the image into lines/words. These split files is then sent to Tesseract for OCR and reassembled to create hOCR output. Celery is used for ad-hoc clustering, making it trivial to add more processing capacity.

pudo commented 10 years ago

Other libs:

PDF2 Text

mattfullerton commented 10 years ago

I'm going to start building this next week. This is my attempt to pull together the various suggestions/ideas so far.

The question is what to start with, as there is so much out there. But as my primary interest is setting something up that can extract text from as many formats as possible, and also be accesible from multiple projects, it seems like wrapping textract in a web service is the best place to start. And it seems that nobody did this yet. It might be good to start with celery from the outset so we can build capacity later given that we will want parallel jobs in any case.

@pudo I'll try to bear in mind the philosophy in centipede and flesh out the API as I go. @pudo There are thoughts on integrating Tika into textract as an additional processing method (https://github.com/deanmalmgren/textract/issues/12)

@cleder If detecting no text and handing off the images to tesseract is not handled well by textract or tricky to implement, the code from the Plone work might be really helpful.

I think at a later stage we can add OCRopus as per @timClicks work to improve quality. @timClicks if you can contribute any of your code, that would be great.

Textract/Language choice

There is a textract for Python, which @pudo mentions. I like Python. Although I've worked with Flask, I do not have a lot of experience building web services with Python. There seems to be good support for using Flask or Django with celery, I'm sure the same goes for Pyramid. There is also a separate but identically named node.js module: https://github.com/dbashford/textract. I have worked with nodejs/express before and was tempted to fork webshot (https://github.com/okfn/webshot) as a starting point which is also nodejs (but see stuff below: I'm tempted to go with Python if webshot is the wrong starting point). Any strong opinions either way? The list of supported formats (https://github.com/dbashford/textract#currently-extracts vs. http://textract.readthedocs.org/en/latest/#currently-supporting) is similar (the big difference being text from sound formats, but do we need that!?). Neither is handling OCR for PDFs, but both offer it for images, so we may have to tweak that part for the case that pdf conversion returns no intelligible text, or offer it as an option (see comment above).

Slow REST

That being said about webshot, we are probably going to need something that returns a reference to a job that can be queried giving the status of the conversion, and potentially (embracing @pudo's efforts to frame services as part of a pipeline) multiple job stages (@pudo, correct me if if I'm misunderstanding), so it may not be the best starting point. Do we know of any nice Python or node projects that implement such an asynchronous API that could be used as a starting point? I've seen some nice references on patterns in general but haven't looked for an example project yet.

Security

Is our primary aim to produce a stack that people have to install themselves and can secure whatever way they like (seems almost a pity when we go to the effort to create a web service) or a publicly accessible service (like webshot)? The latter will require some request limiting and maybe the distribution of API keys, given how resource-intensive the operations could be.

mattfullerton commented 10 years ago

OK, there was one very important suggestion I missed, that we just set up an instance of the Data Science Toolkit (https://github.com/petewarden/dstk)

It is Ruby, and doesn't support the wealth of formats that textract does. Scalability would be done by load balancing to multiple instances.

Specifically its the 'file2text' API that would interest us, handled here: https://github.com/petewarden/dstk/blob/595e4b51261db715af4e71a5be0f37e0ecd75ab6/dstk_server.rb#L1114

rufuspollock commented 10 years ago

@mattfullerton i think nodejs has some nice benefits (e.g. the async setup means when you deploy on e.g. heroku you can serve many clients at once - one request won't block the system) which could be esp relevant here.

However, my guess is that this will be driven by the libraries available. Given that you have got textract in node (though it is just a wrapper on command line utilities) it may be worth going with that - though note we'll have to deploy on a "proper" machine not heroku if we want all those utilities (but we can use labs machines.

Lastly: am I right that textract python and textract node are pretty much identical in functionality? (My guess is that python one may be slightly stabler and better (??))

I have to say webshot as UI and API might be quite nice as inspiration for UI and API.

Last 2c: is it worth trying to write out some explicit user stories - even if obvious. I always find this invaluable :-)

pudo commented 10 years ago

Just to be clear: we're talking about the OCR bit only, right? It'd be very cool to use this to work out bits of the API for the centipede API (ping @malev). One thing in particular is this: is it more useful to hand around the actual documents, or a link to the documents on an S3 bucket? Obviously pushing around the documents is simpler, but using references makes it more lightweight.

I'm very interested to see good implementation of slow REST (e.g. job references) vs. long waits (e.g. node running stuff synchronously and letting you wait on the line) - both have merits, I'd like to know which one is nicer in practice :)

mattfullerton commented 10 years ago

We're talking about file to text, including OCR if necessary, and not necessarily about general pipeline/document management.

I had already come to the conclusion that I want to try this with Python/Flask+Celery/Redis when I saw you (@pudo) have already started with that combination for centipede. I've forked that repo to get started and will try and build on the existing ideas for the API to create something useful also for other document operations.

I guess both slow-REST and long waits enact some pain on the 'user' (client side developer). But if we're going with Python and wouldn't have been using Heroku for node anyway, I think we should try slow-REST first.

chrismattmann commented 9 years ago

Hi Guys, just FYI on this. Apache Tika provides a wrapped version of Tesseract, as a web service. See: http://wiki.apache.org/tika/TikaOCR

rufuspollock commented 9 years ago

@mattfullerton any updates here from your end?

mattfullerton commented 9 years ago

I was working on extending https://github.com/OpenNewsLabs/centipede, but ran out of time due to other project priorities. I like the pipeline/task concept.

But I think it would be easier if I just set up an instance of tika-server for us to test. Ping me again in a week if I haven't done that yet. It looks great: http://wiki.apache.org/tika/TikaJAXRS#Tika_Resource (link doesn't work for me) http://webcache.googleusercontent.com/search?q=cache:MC8ekfYmifcJ:wiki.apache.org/tika/TikaJAXRS+&cd=1&hl=en&ct=clnk&gl=de

mattfullerton commented 9 years ago

I have a working version of Tika dev (1.8) with tesseract here: http://beta.offenedaten.de:9998/tika

Test by doing things like:

curl -T multipage_tiff_example.tif http:///beta.offenedaten.de:9998/tika

Fuller instructions here: https://wiki.apache.org/tika/TikaOCR

You can run your own using Docker by doing:

sudo docker build -t tika github.com/mattfullerton/tika-tesseract-docker
sudo docker run -d -p 9998:9998 tika

I'm very open to improvements to the Docker build files, I am no expert there.

What is lacking now (AFAIK) is detection that standard text extraction from a PDF 'failed' with a fallback to tesseract. We should look into that.

chrismattmann commented 9 years ago

Hey @mattfullerton good work - we're still working through MultiCompositeParsers in Tika (having multiple for a single type instead of our AutoDetect algorithm which picks the best one). We did a work around in Tika 1.7 and 1.8-dev (so far) to combine the ImageExtractor for metadata and then call Tesseract on images. However, for PDF if you want Tesseract to be called, you can always override the declared Mime types for the parser and/or sub-class it and rebuild Tika to get it to work on PDFs.

mattfullerton commented 9 years ago

@chrismattmann Thanks for the tips. Concretely, does that mean that with some passed config there will be support for using tesseract on PDFs instead of the default PDF parser (i.e. client detects if OCR is needed)? Or do you intend to go further and detect the lack of text in the PDF internally (i.e. server detects if OCR is needed)?

rufuspollock commented 9 years ago

@mattfullerton just want to say this really excellent - and do ping the labs list to let them know of your progress (and would you like to do a quick blog post?)

pudo commented 9 years ago

Hi all, just wanted to share a quick update on the document processing pipeline I've been working on, which consists of docpipe (a document processing tool with configurable pipelines) and barn (an OFS knock-off which a slightly more comprehensive API, also used in the openspending S3 data storage branch).

I've invested quite a lot of time into both recently, making sure barn runs against S3 which should be good in terms of the original centipede idea of having pipeline components run on different hosts but access the same virtual data store. At the same time, I've hacked up docpipe to have full support for textract (which does roughly the same thing as Tika, in Python).

All of this is the backend to an app called aleph which I'm using to allow journos to search and tag documents. The whole pipeline is a bit slow, but getting there.

Would be cool to see if there are any docking points?

mattfullerton commented 9 years ago

@rgrp I made a post to the list at the time: https://lists.okfn.org/pipermail/okfn-labs/2015-January/001548.html I'll work on a blog post

@pudo That's great that things are moving forward with the pipeline approach and that it includes textract. Am I right that what is still missing is the web api? Or maybe I missed it.

mattfullerton commented 9 years ago

Blog post: http://okfnlabs.org/blog/2015/02/21/documents-to-text.html

rufuspollock commented 9 years ago

This is fantastic @mattfullerton - will tweet out more monday!

I'd also like to offer a nice url for the service e.g. tika.okfnlabs.org (if we can think of an even cooler subdomain let me know!). This would not require a move of server - just configuring apache/nginx at your end and setting up DNS for the subdomain with open knowledge sysadmins.

wdyt?

todrobbins commented 9 years ago

Does Tika offer JP2 support? Just curious about other archival image types.

mattfullerton commented 9 years ago

@rgrp Good idea, as long as @ddie has no objections. The alternative of course is to try out the docker image on a labs machine. ATM there is nothing to set up at our end (although nginx would let us drop the port) tika.okfnlabs is fairly clear, but we could also go for something fun like text. or givemetext. or x2text...

@todrobbins Yes: http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml (search for jpeg or jp2)

rufuspollock commented 9 years ago

givemetext.okfnlabs.org sounds great. a docker image also sounds really great - but requires other work so let's start with dns.

rufuspollock commented 9 years ago

@mattfullerton shall we ask @nigelbabu do set up the dns for this and you put in the ServerName alias in Apache/Nginx? I'm sure @ddie has no objections re the domain name.

nigelbabu commented 9 years ago

This has already been requested and setup.

rufuspollock commented 9 years ago

@nigelbabu awesome! @mattfullerton you need to set the server alias - http://givemetext.okfnlabs.org/ is still offenendaten ;-)

mattfullerton commented 9 years ago

There's not a whole lot I can do about that given that there is no web front end (yet). On the Tika port it works. On 16 Mar 2015 15:52, "Rufus Pollock" notifications@github.com wrote:

@nigelbabu https://github.com/nigelbabu awesome! @mattfullerton https://github.com/mattfullerton you need to set the server alias - http://givemetext.okfnlabs.org/ is still offenendaten ;-)

— Reply to this email directly or view it on GitHub https://github.com/okfn/ideas/issues/88#issuecomment-81711101.

rufuspollock commented 9 years ago

@mattfullerton not sure I understand. You just need to do a reverse proxy from givemetext.okfnlabs.org on port 80 to your thing running on whatever port you have. Is the site using nginx or apache as main webserver? if you let us know we can help.

mattfullerton commented 9 years ago

I think its nginx - I was just doubting the logic of putting the thing on port 80. Right now these are the two possibilities for showing people when they arrive at givemetext.okfnlabs.org: http://beta.offenedaten.de:9998/ http://beta.offenedaten.de:9998/tika

I haven't promoted the service anywhere as anything other than a listener on that port for PUT requests. And if someone is using it in that way its probably irrelevant what port is in use. If you think it adds value I can add the proxy, but as I would rather serve up a simple one page app on port 80 that allows the uploading of the file to the service and returns the returned text.

rufuspollock commented 9 years ago

Serving it on port 80 is great if you are happy to do that - makes life easier. I'd also server /tika at base location if that's where the action is.

mattfullerton commented 9 years ago

That would be logical, just a pity the text there is so boring at present ;-)

mattfullerton commented 9 years ago

OK, Done

rufuspollock commented 9 years ago

@mattfullerton any further thoughts about the nicer front page?

mattfullerton commented 9 years ago

I started building a little Angular App to do the upload and show the result a while ago, and then I got busy :) Will get back to it soon...

chrismattmann commented 9 years ago

FYI @tpalsulich built a Tika REST upload page

tbpalsulich commented 9 years ago

See http://tpalsulich.github.io/TikaExamples/. You can upload a file and see what text Tika pulls out.

rufuspollock commented 9 years ago

@chrismattmann / @tpalsulich that's great - @mattfullerton has already built http://givemetext.okfnlabs.org/ (see above part of the thread). Perhaps we can join forces?

mattfullerton commented 9 years ago

I will use this instead. Have to update our tika instance so that it supports POSTing instead of PUTing.

@tpalsulich - Does your instance also include tesseract/ocr, and are you looking for traffic? We could include it as a backup server.

mattfullerton commented 9 years ago

Done. The Docker image is now on Tika 1.9, and I added CORS for the service as well so that other web apps can use it.

http://givemetext.okfnlabs.org/ - Web UI, proxied from http://mattfullerton.github.io/TikaExamples/ http://givemetext.okfnlabs.org/tika - proxied from http://givemetext.okfnlabs.org:9998/tika (different from before, where / was proxied to this)

http://givemetext.okfnlabs.org:9998 is open as before but without CORS header

Some kind of friendly instructions on the page like on http://webshot.okfnlabs.org/ would be nice to have.

rufuspollock commented 9 years ago

awesome - and we are about to have a standard labs bootstrap theme we can apply to make it look swish ;-)

mattfullerton commented 9 years ago

There's a typo in there somewhere. We do have one or we don't?

rufuspollock commented 9 years ago

fixed - we are about to have one (literally a couple of days).

tbpalsulich commented 9 years ago

@mattfullerton, no, I don't think it has Tesseract installed (just tried parsing the Google logo -- nothin').

No, I'm not looking for a lot of traffic. I intended the site as more of a quick demonstration of what Tika can do. I happy you guys found it useful!

See https://issues.apache.org/jira/browse/TIKA-1585 for a little more detail. The Tika server is running on a server donated by Rackspace. We use it for testing Tika against large corpuses. So, I don't want to overload it with requests.

chrismattmann commented 9 years ago

@tpalsulich we could probably contact Rackspace and ask them what they think about the traffic, etc. @rgrp @mattfullerton would be happy to join forces! :) FYI too I just completed http://github.com/chrismattmann/tika-python/ which entirely relies now on the REST server and exposes Translation, Language Detection and the full suite of things to make it really usable entirely as a Python library to Tika. So, great timing.

@tpalsulich worst comes to worse, can't they just fork your code and run your code on their OKFN servers?

mattfullerton commented 9 years ago

@chrismattmann We already had a Tika instance running (actually generously hosted by OKF Germany), just without an HTML upload button. @tpalsulich frontend does that and I forked it yesterday: http://givemetext.okfnlabs.org/.

The Python stuff sounds amazing! If I ever get to the point of using the server what I actually wanted it for (full text search for a CKAN instance), I might be able to make good use of it.

tbpalsulich commented 9 years ago

@mattfullerton, awesome! I'm happy you like it. But, I'm getting a DNS lookup error when loading http://www.givemetext.okfnlabs.org/.

mattfullerton commented 9 years ago

Oops! No www.

tbpalsulich commented 9 years ago

That was it. Working now. :+1:

chrismattmann commented 9 years ago

woot :+1: great work @mattfullerton @tpalsulich !

rufuspollock commented 9 years ago

@mattfullerton theme is now ready for you try - see http://okfnlabs.org/app-theme/ and https://github.com/okfn/app-theme. This is generic and you can adapt as you want.

rufuspollock commented 9 years ago

@mattfullerton some quick thoughts / suggestions:

Website tweaks:

mattfullerton commented 9 years ago

@rgrp All good ideas, especially the examples bit I wanted to do. I'm not really finished with the theme either, that was done very quickly. Just a bit swamped at the moment.

Regarding the source repo, in case its not clear - my only contribution here is creating a Dockerfile that gets the slightly complicated Tika, and specifically Tika-server, up and running with OCR support built in. Tika is an Apache project: https://tika.apache.org/ My small contribution is here: https://github.com/mattfullerton/tika-tesseract-docker

Hence getting support for file URLs within the API (if not already there, I'll have to look) would require modifying Tika itself. I'll look into it but the Tika developers would have the final say; the alternative of course would be our own little micro man-in-the-middle service to download the link in the background.

rufuspollock commented 9 years ago

@mattfullerton got you re time and thanks for flagging the repo - i can open issues there going forward - and people can contribute there (e.g. i assume the front page code is there).

Re the url point: do you have to modify Tika?I thought Tika could take a "stream" like object - and you can just open a url as a stream. Worse case you cache the URL contents to disk as file and then load.