ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.41k stars 590 forks source link

Manually correcting segmentation #323

Open nederhof opened 5 years ago

nederhof commented 5 years ago

I know that the files *.pseg.png store the coordinates of the automatic line segmentation. I have seen mention on the Web of the use of GIMP for manipulating these coordinates, but without further details on how this is done. When I open GIMP on these files, I see nothing that shows the segmentation, nor anything that I can edit manually.

For background: I am trying to do OCR for some documents that are in a poor state, with smudges and faded ink, and no matter how much image preprocessing I do, automatic segmentation fails on at least some parts of the page. I see manual adjustment as the only viable way forward. That is, I would like to manually remove lines, add new lines, change the positions of lines, and ideally also change the order of lines, before the actual OCR is done. Can I insert this manual correction into the usual OCRopus workflow?

Thanks in advance for your time.

Mark-Jan Nederhof

wrznr commented 5 years ago

@nederhof I don't think that this is possible since ocropy works on images rather than metadata (i.e. some files indicating the positions of the segments in terms of coordinates). What you could do however is using a 3rd party tool like Transkribus or Aletheia to do the initial layout/line recognition. Both tools offer options for manual post-correction. When your done, export the result, extract the lines from your original images, and run ocropy for text recognition.

Come to think of it, you could also use ocropus-hocr to create hOCR files for your initial recognition. Import them to Aletheia, correct the segmentation, export, rerun ocropy.

zuphilip commented 5 years ago

The *.pseg.png files are normal PNG files where the color code is used to encode the information about layout, see here for more information https://github.com/tmbdev/ocropy/wiki/OCRopus-File-Formats#physical-layout .

Since some time you can now also use masks to help the layout segmentation. I think this is now yet documented well, but you can have a look at the initial pull request.