qurator-spk / eynollah

Document Layout Analysis
Apache License 2.0
332 stars 27 forks source link

Order of regions #51

Closed AriVesalainen closed 1 year ago

AriVesalainen commented 3 years ago

The order of text regions detected by eynollah is not right. When running eynollah-segment for the attached image the text regions are presented in wrong order.

The workflow used is: "olena-binarize -I OCR-D-IMG -O OCR-D-BIN" "eynollah-segment -I OCR-D-BIN -O OCR-D-SEG -P models default -P curved_line true" "tesserocr-recognize -I OCR-D-SEG -O OCR-D-OCR-TESSEROCR -P model ecco" 'fileformat-transform -I OCR-D-OCR-TESSEROCR -O OCR-D-TEXT -P from-to "page text"'

Test19.zip

vahidrezanezhad commented 3 years ago

Dear @AriVesalainen ,

reading_order_false? Based on your result, this is the reading order. But I couldn't detect the mistake with the reading order. Can you explain in detail what is wrong with reading order?

AriVesalainen commented 3 years ago

I double checked and your are right: the output of segmentation and recognition are showing the right order but "fileformat-transform" extracts the paragraphs in wrong order.

bertsky commented 3 years ago

I believe this is due to #22 – so in essence, the representation in eynollah is consistent with PageViewer, but wrong w.r.t. PAGE-XML (and thus also XSL transformations).

kba commented 3 years ago

I double checked and your are right: the output of segmentation and recognition are showing the right order but "fileformat-transform" extracts the paragraphs in wrong order.

I believe this is due to #22 – so in essence, the representation in eynollah is consistent with PageViewer, but wrong w.r.t. PAGE-XML (and thus also XSL transformations).

ocrd_fileformat should use https://github.com/kba/page-to-alto for the PAGE transformation now, not XSLT and should respect PAGE-XML reading order. I think we need a new release for ocrd_fileformat and ocrd_all.

bertsky commented 3 years ago

ocrd_fileformat should use https://github.com/kba/page-to-alto for the PAGE transformation now, not XSLT and should respect PAGE-XML reading order. I think we need a new release for ocrd_fileformat and ocrd_all.

Ah, sry, was not aware of that. But still: outside of PRImA core libs and PageViewer and PageConverter and eynollah, we have to stick with the PAGE-XML spec, which requires using @index instead of XML ordering. And that's also what OCR-D and thus page-to-alto does.

So IMO this is still a duplicate of #22. (IIRC the actual blocker is that we have no respone on https://github.com/PRImA-Research-Lab/prima-core-libs/issues/13 yet.)

bertsky commented 3 years ago

Ah, sry, was not aware of that. But still: outside of PRImA core libs and PageViewer and PageConverter and eynollah, we have to stick with the PAGE-XML spec, which requires using @index instead of XML ordering. And that's also what OCR-D and thus page-to-alto does.

I was not aware that eynollah has already fixed #22 in the meantime by sorting on @index before serialization. (This is enough to make both PageViewer and OCR-D happy.)

Also, @kba I misread your should as you believed it to be that way already, instead of you calling for action to make ocrd_filetransform start using page-to-alto (which I fully support in light of this, as implementing @index sorting would be hard to do with XSLT).