EN spelling dictionary indispensable?

eemantsal commented 7 years ago

Hi again.

When I start gImageReader I get this warning, and the language selector is greyed and disabled.

captura_20170315_185634

I have Hunspell only in my native language, and Teseract correctly installed in my language too. In order to being able to use gImageReader is mandatory to install the amercian version of Hunspell?

manisandro commented 7 years ago

Hi, you need to load an image before you can proceed select the language (problem is that the OCR is a button with a menu, and the OCR button needs to be disabled when no image is loaded, so the menu gets disabled as well as a side consequence).

eemantsal commented 7 years ago

Oh, well, that's embarrasing. I should have tried that, heheh. Ok, now it works.

Let me ask something more: The "+" shaped cursor, meant to indicate that one can draw square zones by hand, doesn't work in the PDF mode or is it some error on my side? I get the + cursor but clicking and dragging does nothing. Also, It seems it's not possible to indicate which zones are text and which ones are image. One has to save the text and then the images one by one, right? Perhaps you might be interested in have a look to OCRFeeder: http://www.joaquimrocha.com/2014/12/22/ocrfeeder-0-8-1/ It has an ugly GTK interface, and doesn't let the user rotate the image by hand, it can do it automatically, but with weird results (in my installation it cuts the image, so part of the text disappears). I think GImageReader is better in general, but the other has a few nice features you could take into account for future improvents if you consider them nice: OCRFeeder has a nice pair of buttons to indicate the program if a zone is text and must be recognized, or image, and must just be saved as such in the document, which, BTW, can be an ODT document, that keeps very exactly the layout and I think is very convinient for latter edition. Besides, it uses Unpaper for, supposedly, improving the image sharpness a bit so the recognition shall be better (or that's what they say, heheh)

Here's its GitHub: https://github.com/GNOME/ocrfeeder

I hope you find it interesting.

manisandro commented 7 years ago

So in hOCR mode recognition is always performed on the entire page. From a UI perspective it is easier to have the entire page recognized and then allow the user to throw away the parts it does not want in the recognized document tree than having the user select single portions of the image and then attempting to reassemble these portions to a page. Which regions are text and which are images is detected automatically. If an image region is misdetected as text, you can remove the corresponding items in the document tree and add a manually defined image region. ODT is something I'd like to support in the future, however the lack of a decent c++ library for writing the format is a hindrance.

eemantsal commented 7 years ago

That would be a perfect approach if things were perfect, heheh, but sadly they aren't, and automatic layout detection usually fails miserably in every single OCR software I have ever known, even the famous Omnipage for Windows, when images are B/W and/or graphics with lots of lines (photos and drawings are usually correctly detected, that's right). Try to OCR a page with texts and diagrams (especially in black and white, as I said), or music, try to recognize a page with a score and the lyrics below. Automatic detection in cases like these are, almost always, a disaster, the diagrams or the musical symbols are interpreted a alphabetic characters, or some parts as text and some as images, and you end losing more time repairing the mess that defining the frames by hand and how the OCR must consder then from the beginning.

Anyway, I agree that when it works well, it's handier an automatic process, and it probably should be the per default behavior; but I think there should be the option for the user to define zones by hand and to tell the program which ones are text and which are image. That's more or less the habitual behavior in most OCR software, for example the mentioned Omnipage, or the other open source app I mentioned yesterday, OCRFeeder.

If an image region is misdetected as text, you can remove the corresponding items in the document tree and add a manually defined image region.

Didn't tried that. I will. But anyway, woudln't it be easier to have a selector to just change a correctly framed area that might be wrongly identified as text or image, instead having to delete it and manually define again? Something like those two buttons in OCRFeeder. Please look at the screenshots: Here you can see how the text column at the very left is indentified as text, colored with a bluish shade: captura_20170316_150759

Notice the two buttons at the right, «Texto» and «Imagen» (this is the hispanic version of the interface, obviously if you try it in other language the buttons text will vary), and how I clicked on «Imagen» after selecting that first text column mentioned before. If you look at it you can see now its greenish shade, which indicates it will be treated as an image and won't be "OCRed". The «Propiedades del texto» area (low right of the screenshot) becomes inactive too. captura_20170316_150842

I don't mean this interface should be copied, in fact I think it wastes much space just for a pair of buttons, it's design could be better; but the idea of some kind of selector that permits the user correct the automatic zone identification.

ODT is something I'd like to support in the future, however the lack of a decent c++ library for writing the format is a hindrance.

I know nothing about coding, but coudln't you take the same library other similar programs use? I know I sound soomehow obssesive, heheh, but what library uses OCRFeeder? I suppose you could use it, no?

manisandro commented 7 years ago

Ok so defining the regions beforehand and then recognizing the page should be doable, I'll need to look into it as time allows.

About the library: there is a good python library, but gImageReader is c++ based, and I'd rather not have to depend on the entire python interpreter just to be able to use this library.

eemantsal commented 7 years ago

gImageReader is c++ based, and I'd rather not have to depend on the entire python interpreter just to be able to use this library

There's an open source ODT editor from an office suite for Linux calle Calligra Words. I think it's written in C or C++, or something with C, xD. Don't know if it could be any helpful. This the technical part, for which I declare myself totally incompetent, sorry. But just in case, here's the code, if you want to have a look on it: https://community.kde.org/Calligra/Building/2#Latest_Stable_Version

Appendix

I found this on the Internet. Perhaps could be useful?: http://stackoverflow.com/questions/12349604/libraries-for-odt-formatting

Also, here (about the end of the first half of the page) they say: «QTextDocumentWriter class makes it possible to create OpenDocument Format (ODF) files from any Qt text document. This opens the door to automated document creation and distribution in a standards-compliant format that users can open in a wide variety of word processors.» QT uses C++, no? And gImageReader uses Qt 5. Maybe this casts some light: http://opendocsociety.org/tools/odf-tools

manisandro commented 7 years ago

Yes I'm aware of calligra, thanks, problem is needing pick out the code from a large application vs just having a handly library one can use in the program. As always, it's a question of time and effort.

eemantsal commented 7 years ago

I understand. And what about "QTextDocumentWriter class"? Excuse if my question is a nonsense, it's fruit of this coding illiterate that I am, heheh: are classes and libraries any similar in their functionality, I mean, even if they weren't the same, could a class like that by QT be easily implemented to provide this export to ODT feature?

manisandro commented 7 years ago

Problem is that QTextDocumentWriter is Qt which isn't really usable for the Gtk variant of gImageReader (and I really want to keep the Qt and Gtk variants equivalent). Probably the simplest approach is if I code a minimal ODT exporter which just supports the necessary parts of the ODT format. If you want to help, you could look up the ODT spec and collect the minimal set of XML tags (ODT ist just an XML format) which allows for paragraphs, images, font size and family as well as bold/italic. If you can provide a sample XML which includes these elements, I would be able to do the actual coding fairly quickly.

eemantsal commented 7 years ago

QTextDocumentWriter is Qt which isn't really usable for the Gtk

Ah. I didn't know there were a GTK version of gImageReader, sorry.

you could look up the ODT spec and collect the minimal set of XML tags (ODT ist just an XML format) which allows for paragraphs, images, font size and family as well as bold/italic. If you can provide a sample XML which includes these elements

About making that XML thing... I think that falls far beyond my limited technical skills. I've tried to do some search even if I wasn't really understanding what I was seeing, but I've found this, that seems a sort of list of tags; don't know if it's what you were asking for (the page is in spanish, but the tags list is in english): https://wiki.dolibarr.org/index.php/Crear_un_modelo_de_documento_ODT#Tags

Another source concerning tags I've read, Wikipedia, says:

OpenDocument reuses existing open XML standards whenever they are available, and it creates new tags only where no existing standard can provide the needed functionality. Thus OpenDocument uses a subset of DublinCore for metadata, MathML for displayed formulas, SMIL for multimedia, XLink for hyperlinks etc.

https://en.wikipedia.org/wiki/OpenDocument_technical_specification#Reuse_of_existing_formats

Sorry for not being more helpful :-/

Edit: There's a version in english for the first page I linked. I didn't see it before. Sorry: https://wiki.dolibarr.org/index.php/Create_an_ODT_document_template#Tags

Hope it helps in some way.

Edit 2: Ok, forget the Dolibarr links I posted. It seems that those tags aren't ODT/ODF tags but tags for that Dolibarr software. The page's first line says: «This page describe how to build an ODT document template to build documents using ODT generation.», so I thought it was referring to ODT tags. Sorry for the confusion.

manisandro commented 7 years ago

I've opened #171 and #172 to track the features we ended up discussing here. Closing this ticket.

manisandro / gImageReader

EN spelling dictionary indispensable? #165