Open maxnth opened 4 years ago
Could you upload the result before the dewarping step, too? My hunch is that the dewarping produces too thick lines. Ideally, upload the whole workspace contents for this page.
There is also an issue that the line texts aren't matching the line images, but this could just be an issue with PAGEViewer:
The text box should give the result for the first line, but gives some text from the second line.
Could you upload the result before the dewarping step, too?
OCR-D-SEG-LINE_0005.xml
(sbb-seg.zip) is the the PAGE XML outputted by sbb-textline-detector
, OCR-D-OCR2_0005.xml
and OCR-TXT2_0005.txt
is the OCR output when running calamari-recognize
directly on OCR-D-SEG-LINE_0005.xml
.
There is also an issue that the line texts aren't matching the line images
That bug (?) appears in LAREX as well but my first thought was that it's a problem caused by LAREX as it's not really 100% compatible with OCR-D yet.
Here is the result with my (a lot simpler) my_ocrd_workflow.
2020-10-sbb_textline_detection-issue-42.zip
The result is fine (paragraphs 2+3):
Die gewoͤhnlichſte Schrift iſt die Current⸗ ſchrift, deren Buchſtaben nicht zu gerade herun— ter, ſondern mehr von der Rechten zur Linken herabliegend geſchrieben werden muͤſſen — Es iſt gut, wenn die unter oder uͤber die Linie hervor⸗ ragende Buchſtaben alle gleich weit hervorragen. Es iſt ein großer Fehler, wenn die Buchſtaben zu gedraͤngt ſtehen, oder zu weit gedehnt ſind. Auch muß man ſich huͤten, die Buchſtaben, die zu— ſammen Ein Wort ausmachen, einzeln zu ſchrei— ben, ſondern ſie muͤſſen, ſo viel moͤglich iſt, ſo wohl mit den vorhergehenden, als mit den fol— genden zuſammenhaͤngen. — Man muß den Currentbuchſtaben nicht un— nuͤtze Zierrathen anhaͤngen, oder ihre Schwei⸗ fungen zu ſehr vergroͤßern.
So the problem is somewhere in all the cropping/dewarping/deskewing or the handling thereof. This is going to take some time to debug. But I wanted to check out the dewarping anyway ;-)
Using your more minimal workflow with sbb-textline-detector
gave me the same results (which look a bit more like the result I expected :D )
Yeah, superficially I only see problems with the hyphens.
I tried switching off different pre-processing steps before segmenting (seeing that minimal pre-processing seems to work just fine in this case) and it seems that cropping is responsible for the bad results.
The above workflow without anybaseocr-crop
yields good results, turning off the other pre-processing steps but leaving cropping in the workflow always yields bad results for this page.
Thanks for the analysis. I'll look into the problem, could be an API problem in ocrd-sbb-texline-detector
.
Other @OCR-D users also reported issues with anybaseocr-crop
. But if the expected results can in fact be achieved with https://github.com/mikegerber/my_ocrd_workflow/, this rather hints at a problem in the OCR-D workflow or in the way ocrd-sbb-textline-detector
writes its output PAGE-XML (cc @kba).
Btw there is also this nice fork https://github.com/sulzbals/gbn which provides a more granular API that is @OCR-D compliant, in case this may be useful for testing/debugging.
Regarding the way cropping and line-deskewing/dewarping are applied by sbb-textline-detector
, @vahidrezanezhad can fill in the details much better than me.
I changed the code to retrieve the image and to calculate the coordinates, could you try again with current master/ 020ffbc? (I don't have a setup of anybaseocr + cis-ocropy yet, so it would help if you could try again.)
Possibly relates to https://github.com/qurator-spk/sbb_textline_detection/pull/48
I'm not sure whether this is the right place to ask as
sbb-textline-detector
itself worked perfectly in our OCR-D workflows and the produced segmentation results look good as well but running any recognition (calamari-recognize
as well astesserocr-recognize
) afterwards yields weird text output that seems worse than it should be (regarding the good segmentation results).I basically used the (formerly) recommended workflow and substituted everything starting from the region segmentation up to the line segmentation with
sbb-textline-detector
.The region segmentation produced by this looks pretty good and this impression is confirmed by the pixel accuracy evaluation we ran for several segmentation workflows (with
cis-ocropy-segment
,tesserocr-segment-region
, …). The line segmentation looks pretty good as well and should probably be a good basis for running OCR on it but as stated above the results are somehow surprisingly bad. I tried to run the recognition directly on the produced segmentation (OCR-D-SEG-LINE
) without dewarping first but the results are even worse that way.Am I missing something obvious (e.g. adding a certain step after running
sbb-textline-detector
)?Workflow steps
Region segmentation output
Line segmentation output
Text output!
The input image for the example page and the produced PAGE XML can be found here in case it helps.