Closed rufuspollock closed 8 years ago
Other libs:
PDF2 Text
I'm going to start building this next week. This is my attempt to pull together the various suggestions/ideas so far.
The question is what to start with, as there is so much out there. But as my primary interest is setting something up that can extract text from as many formats as possible, and also be accesible from multiple projects, it seems like wrapping textract in a web service is the best place to start. And it seems that nobody did this yet. It might be good to start with celery from the outset so we can build capacity later given that we will want parallel jobs in any case.
@pudo I'll try to bear in mind the philosophy in centipede and flesh out the API as I go. @pudo There are thoughts on integrating Tika into textract as an additional processing method (https://github.com/deanmalmgren/textract/issues/12)
@cleder If detecting no text and handing off the images to tesseract is not handled well by textract or tricky to implement, the code from the Plone work might be really helpful.
I think at a later stage we can add OCRopus as per @timClicks work to improve quality. @timClicks if you can contribute any of your code, that would be great.
There is a textract for Python, which @pudo mentions. I like Python. Although I've worked with Flask, I do not have a lot of experience building web services with Python. There seems to be good support for using Flask or Django with celery, I'm sure the same goes for Pyramid. There is also a separate but identically named node.js module: https://github.com/dbashford/textract. I have worked with nodejs/express before and was tempted to fork webshot (https://github.com/okfn/webshot) as a starting point which is also nodejs (but see stuff below: I'm tempted to go with Python if webshot is the wrong starting point). Any strong opinions either way? The list of supported formats (https://github.com/dbashford/textract#currently-extracts vs. http://textract.readthedocs.org/en/latest/#currently-supporting) is similar (the big difference being text from sound formats, but do we need that!?). Neither is handling OCR for PDFs, but both offer it for images, so we may have to tweak that part for the case that pdf conversion returns no intelligible text, or offer it as an option (see comment above).
That being said about webshot, we are probably going to need something that returns a reference to a job that can be queried giving the status of the conversion, and potentially (embracing @pudo's efforts to frame services as part of a pipeline) multiple job stages (@pudo, correct me if if I'm misunderstanding), so it may not be the best starting point. Do we know of any nice Python or node projects that implement such an asynchronous API that could be used as a starting point? I've seen some nice references on patterns in general but haven't looked for an example project yet.
Is our primary aim to produce a stack that people have to install themselves and can secure whatever way they like (seems almost a pity when we go to the effort to create a web service) or a publicly accessible service (like webshot)? The latter will require some request limiting and maybe the distribution of API keys, given how resource-intensive the operations could be.
OK, there was one very important suggestion I missed, that we just set up an instance of the Data Science Toolkit (https://github.com/petewarden/dstk)
It is Ruby, and doesn't support the wealth of formats that textract does. Scalability would be done by load balancing to multiple instances.
Specifically its the 'file2text' API that would interest us, handled here: https://github.com/petewarden/dstk/blob/595e4b51261db715af4e71a5be0f37e0ecd75ab6/dstk_server.rb#L1114
@mattfullerton i think nodejs has some nice benefits (e.g. the async setup means when you deploy on e.g. heroku you can serve many clients at once - one request won't block the system) which could be esp relevant here.
However, my guess is that this will be driven by the libraries available. Given that you have got textract in node (though it is just a wrapper on command line utilities) it may be worth going with that - though note we'll have to deploy on a "proper" machine not heroku if we want all those utilities (but we can use labs machines.
Lastly: am I right that textract python and textract node are pretty much identical in functionality? (My guess is that python one may be slightly stabler and better (??))
I have to say webshot as UI and API might be quite nice as inspiration for UI and API.
Last 2c: is it worth trying to write out some explicit user stories - even if obvious. I always find this invaluable :-)
Just to be clear: we're talking about the OCR bit only, right? It'd be very cool to use this to work out bits of the API for the centipede API (ping @malev). One thing in particular is this: is it more useful to hand around the actual documents, or a link to the documents on an S3 bucket? Obviously pushing around the documents is simpler, but using references makes it more lightweight.
I'm very interested to see good implementation of slow REST (e.g. job references) vs. long waits (e.g. node running stuff synchronously and letting you wait on the line) - both have merits, I'd like to know which one is nicer in practice :)
We're talking about file to text, including OCR if necessary, and not necessarily about general pipeline/document management.
I had already come to the conclusion that I want to try this with Python/Flask+Celery/Redis when I saw you (@pudo) have already started with that combination for centipede. I've forked that repo to get started and will try and build on the existing ideas for the API to create something useful also for other document operations.
I guess both slow-REST and long waits enact some pain on the 'user' (client side developer). But if we're going with Python and wouldn't have been using Heroku for node anyway, I think we should try slow-REST first.
Hi Guys, just FYI on this. Apache Tika provides a wrapped version of Tesseract, as a web service. See: http://wiki.apache.org/tika/TikaOCR
@mattfullerton any updates here from your end?
I was working on extending https://github.com/OpenNewsLabs/centipede, but ran out of time due to other project priorities. I like the pipeline/task concept.
But I think it would be easier if I just set up an instance of tika-server for us to test. Ping me again in a week if I haven't done that yet. It looks great: http://wiki.apache.org/tika/TikaJAXRS#Tika_Resource (link doesn't work for me) http://webcache.googleusercontent.com/search?q=cache:MC8ekfYmifcJ:wiki.apache.org/tika/TikaJAXRS+&cd=1&hl=en&ct=clnk&gl=de
I have a working version of Tika dev (1.8) with tesseract here: http://beta.offenedaten.de:9998/tika
Test by doing things like:
curl -T multipage_tiff_example.tif http:///beta.offenedaten.de:9998/tika
Fuller instructions here: https://wiki.apache.org/tika/TikaOCR
You can run your own using Docker by doing:
sudo docker build -t tika github.com/mattfullerton/tika-tesseract-docker
sudo docker run -d -p 9998:9998 tika
I'm very open to improvements to the Docker build files, I am no expert there.
What is lacking now (AFAIK) is detection that standard text extraction from a PDF 'failed' with a fallback to tesseract. We should look into that.
Hey @mattfullerton good work - we're still working through MultiCompositeParsers in Tika (having multiple for a single type instead of our AutoDetect algorithm which picks the best one). We did a work around in Tika 1.7 and 1.8-dev (so far) to combine the ImageExtractor for metadata and then call Tesseract on images. However, for PDF if you want Tesseract to be called, you can always override the declared Mime types for the parser and/or sub-class it and rebuild Tika to get it to work on PDFs.
@chrismattmann Thanks for the tips. Concretely, does that mean that with some passed config there will be support for using tesseract on PDFs instead of the default PDF parser (i.e. client detects if OCR is needed)? Or do you intend to go further and detect the lack of text in the PDF internally (i.e. server detects if OCR is needed)?
@mattfullerton just want to say this really excellent - and do ping the labs list to let them know of your progress (and would you like to do a quick blog post?)
Hi all, just wanted to share a quick update on the document processing pipeline I've been working on, which consists of docpipe (a document processing tool with configurable pipelines) and barn (an OFS knock-off which a slightly more comprehensive API, also used in the openspending S3 data storage branch).
I've invested quite a lot of time into both recently, making sure barn runs against S3 which should be good in terms of the original centipede idea of having pipeline components run on different hosts but access the same virtual data store. At the same time, I've hacked up docpipe to have full support for textract (which does roughly the same thing as Tika, in Python).
All of this is the backend to an app called aleph which I'm using to allow journos to search and tag documents. The whole pipeline is a bit slow, but getting there.
Would be cool to see if there are any docking points?
@rgrp I made a post to the list at the time: https://lists.okfn.org/pipermail/okfn-labs/2015-January/001548.html I'll work on a blog post
@pudo That's great that things are moving forward with the pipeline approach and that it includes textract. Am I right that what is still missing is the web api? Or maybe I missed it.
This is fantastic @mattfullerton - will tweet out more monday!
I'd also like to offer a nice url for the service e.g. tika.okfnlabs.org (if we can think of an even cooler subdomain let me know!). This would not require a move of server - just configuring apache/nginx at your end and setting up DNS for the subdomain with open knowledge sysadmins.
wdyt?
Does Tika offer JP2 support? Just curious about other archival image types.
@rgrp Good idea, as long as @ddie has no objections. The alternative of course is to try out the docker image on a labs machine. ATM there is nothing to set up at our end (although nginx would let us drop the port) tika.okfnlabs is fairly clear, but we could also go for something fun like text. or givemetext. or x2text...
@todrobbins Yes: http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml (search for jpeg or jp2)
givemetext.okfnlabs.org sounds great. a docker image also sounds really great - but requires other work so let's start with dns.
@mattfullerton shall we ask @nigelbabu do set up the dns for this and you put in the ServerName alias in Apache/Nginx? I'm sure @ddie has no objections re the domain name.
This has already been requested and setup.
@nigelbabu awesome! @mattfullerton you need to set the server alias - http://givemetext.okfnlabs.org/ is still offenendaten ;-)
There's not a whole lot I can do about that given that there is no web front end (yet). On the Tika port it works. On 16 Mar 2015 15:52, "Rufus Pollock" notifications@github.com wrote:
@nigelbabu https://github.com/nigelbabu awesome! @mattfullerton https://github.com/mattfullerton you need to set the server alias - http://givemetext.okfnlabs.org/ is still offenendaten ;-)
— Reply to this email directly or view it on GitHub https://github.com/okfn/ideas/issues/88#issuecomment-81711101.
@mattfullerton not sure I understand. You just need to do a reverse proxy from givemetext.okfnlabs.org on port 80 to your thing running on whatever port you have. Is the site using nginx or apache as main webserver? if you let us know we can help.
I think its nginx - I was just doubting the logic of putting the thing on port 80. Right now these are the two possibilities for showing people when they arrive at givemetext.okfnlabs.org: http://beta.offenedaten.de:9998/ http://beta.offenedaten.de:9998/tika
I haven't promoted the service anywhere as anything other than a listener on that port for PUT requests. And if someone is using it in that way its probably irrelevant what port is in use. If you think it adds value I can add the proxy, but as I would rather serve up a simple one page app on port 80 that allows the uploading of the file to the service and returns the returned text.
Serving it on port 80 is great if you are happy to do that - makes life easier. I'd also server /tika at base location if that's where the action is.
That would be logical, just a pity the text there is so boring at present ;-)
OK, Done
@mattfullerton any further thoughts about the nicer front page?
I started building a little Angular App to do the upload and show the result a while ago, and then I got busy :) Will get back to it soon...
FYI @tpalsulich built a Tika REST upload page
See http://tpalsulich.github.io/TikaExamples/. You can upload a file and see what text Tika pulls out.
@chrismattmann / @tpalsulich that's great - @mattfullerton has already built http://givemetext.okfnlabs.org/ (see above part of the thread). Perhaps we can join forces?
I will use this instead. Have to update our tika instance so that it supports POSTing instead of PUTing.
@tpalsulich - Does your instance also include tesseract/ocr, and are you looking for traffic? We could include it as a backup server.
Done. The Docker image is now on Tika 1.9, and I added CORS for the service as well so that other web apps can use it.
http://givemetext.okfnlabs.org/ - Web UI, proxied from http://mattfullerton.github.io/TikaExamples/ http://givemetext.okfnlabs.org/tika - proxied from http://givemetext.okfnlabs.org:9998/tika (different from before, where / was proxied to this)
http://givemetext.okfnlabs.org:9998 is open as before but without CORS header
Some kind of friendly instructions on the page like on http://webshot.okfnlabs.org/ would be nice to have.
awesome - and we are about to have a standard labs bootstrap theme we can apply to make it look swish ;-)
There's a typo in there somewhere. We do have one or we don't?
fixed - we are about to have one (literally a couple of days).
@mattfullerton, no, I don't think it has Tesseract installed (just tried parsing the Google logo -- nothin').
No, I'm not looking for a lot of traffic. I intended the site as more of a quick demonstration of what Tika can do. I happy you guys found it useful!
See https://issues.apache.org/jira/browse/TIKA-1585 for a little more detail. The Tika server is running on a server donated by Rackspace. We use it for testing Tika against large corpuses. So, I don't want to overload it with requests.
@tpalsulich we could probably contact Rackspace and ask them what they think about the traffic, etc. @rgrp @mattfullerton would be happy to join forces! :) FYI too I just completed http://github.com/chrismattmann/tika-python/ which entirely relies now on the REST server and exposes Translation, Language Detection and the full suite of things to make it really usable entirely as a Python library to Tika. So, great timing.
@tpalsulich worst comes to worse, can't they just fork your code and run your code on their OKFN servers?
@chrismattmann We already had a Tika instance running (actually generously hosted by OKF Germany), just without an HTML upload button. @tpalsulich frontend does that and I forked it yesterday: http://givemetext.okfnlabs.org/.
The Python stuff sounds amazing! If I ever get to the point of using the server what I actually wanted it for (full text search for a CKAN instance), I might be able to make good use of it.
@mattfullerton, awesome! I'm happy you like it. But, I'm getting a DNS lookup error when loading http://www.givemetext.okfnlabs.org/.
Oops! No www.
That was it. Working now. :+1:
woot :+1: great work @mattfullerton @tpalsulich !
@mattfullerton theme is now ready for you try - see http://okfnlabs.org/app-theme/ and https://github.com/okfn/app-theme. This is generic and you can adapt as you want.
@mattfullerton some quick thoughts / suggestions:
Website tweaks:
@rgrp All good ideas, especially the examples bit I wanted to do. I'm not really finished with the theme either, that was done very quickly. Just a bit swamped at the moment.
Regarding the source repo, in case its not clear - my only contribution here is creating a Dockerfile that gets the slightly complicated Tika, and specifically Tika-server, up and running with OCR support built in. Tika is an Apache project: https://tika.apache.org/ My small contribution is here: https://github.com/mattfullerton/tika-tesseract-docker
Hence getting support for file URLs within the API (if not already there, I'll have to look) would require modifying Tika itself. I'll look into it but the Tika developers would have the final say; the alternative of course would be our own little micro man-in-the-middle service to download the link in the background.
@mattfullerton got you re time and thanks for flagging the repo - i can open issues there going forward - and people can contribute there (e.g. i assume the front page code is there).
Re the url point: do you have to modify Tika?I thought Tika could take a "stream" like object - and you can just open a url as a stream. Worse case you cache the URL contents to disk as file and then load.
Originally: http://ideas.okfn.org/ideas/106/pdf-tiff-scan-to-text-conversion-service/
Note: for generic PDF to text (including but not necessarily OCR) - see #52 (simple pdf to text service)
Quote from Tim: