qurator-spk / eynollah

Document Layout Analysis
Apache License 2.0
332 stars 27 forks source link

Reverse text line order from OCR-D #43

Closed aurichje closed 3 years ago

aurichje commented 3 years ago

Hi, using eynollah in a OCR-D workflow produced a reverse text line order within each region, so that the last actual line is line_001 in the PAGE XML.

I'm new to eynollah and OCR-D, so I might have made a mistake somewhere. Any ideas anyone? Thanks!

I used this workflow:

ocrd process \
  "sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P model default" \
  "eynollah-segment -I OCR-D-BIN -O OCR-D-SEG -P models default -P curved_line true" \
  "calamari-recognize -I OCR-D-SEG -O OCR-D-OCR -P checkpoint_dir qurator-gt4histocr-1.0"

Used image

PageView screenshot ![PageView screenshot](https://user-images.githubusercontent.com/56193556/122393610-56d73600-cf75-11eb-829e-bac8b5e373f6.png)

And here's the xml section corresponding to the first news paragraph:

XML ```xml Kulturkampfes halten. haben, weil ſie ihn für einen Gegner Bismarcks und des bereiteten Feierlichkeiten zeigten, mag darin ſeinen Grund Schwarzen ſich weniger zurückhaltend bei den dem Kronprinzen allenthalben einen ſympathiſchen Empfang. Daß auch die Reiches. der in Bahern mehrere Truppenrevüen abhielt, fand gendſten Truppeninſpektionen vor. Der Kronprinz des Deutſchen ſich des veſten Wohlſeins und nimmt noch häufig die anſtren— ſehen und begünſtigen. Se. M. der Deutſche Kaiſer erfreut höheren geiſtlichen Behörden ſolche Vorpoftengefechte gerne der Tagesordnung und werden ſolange vorkommen, als die ſperrungen zelotiſcher Hetzkapläne ſtehen auch jetzt noch auf d. h. Nachrichten von größerem Belange, denn kleine Ein— Nachrichten in der letzten Zeit etwas ſparſamer geworden, Von den deutfchen Cultur-Kampfſtätten ſind die Roſenheim, den 5. September. Kulturkampfes halten. haben, weil ſie ihn für einen Gegner Bismarcks und des bereiteten Feierlichkeiten zeigten, mag darin ſeinen Grund Schwarzen ſich weniger zurückhaltend bei den dem Kronprinzen allenthalben einen ſympathiſchen Empfang. Daß auch die Reiches. der in Bahern mehrere Truppenrevüen abhielt, fand gendſten Truppeninſpektionen vor. Der Kronprinz des Deutſchen ſich des veſten Wohlſeins und nimmt noch häufig die anſtren— ſehen und begünſtigen. Se. M. der Deutſche Kaiſer erfreut höheren geiſtlichen Behörden ſolche Vorpoftengefechte gerne der Tagesordnung und werden ſolange vorkommen, als die ſperrungen zelotiſcher Hetzkapläne ſtehen auch jetzt noch auf d. h. Nachrichten von größerem Belange, denn kleine Ein— Nachrichten in der letzten Zeit etwas ſparſamer geworden, Von den deutfchen Cultur-Kampfſtätten ſind die Roſenheim, den 5. September. ```
mikegerber commented 3 years ago

This may be triggered by -P curved_line true, could you try it without?

aurichje commented 3 years ago

Sure. Without the curved_line parameter set to true I don't seem to get any textlines at all, just regions.

Workflow:

ocrd process \
  "sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P model default" \
  "eynollah-segment -I OCR-D-BIN -O OCR-D-SEG -P models default" \
  "calamari-recognize -I OCR-D-SEG -O OCR-D-OCR -P checkpoint_dir qurator-gt4histocr-1.0"
part of XML ( SEG) ```xml OCR-D/core 2.23.3 2021-06-17T16:11:54.470845 2021-06-17T16:11:54.470845 ```
mikegerber commented 3 years ago

That's even weirder 😄 Especially because that's seems to be the exact workflow I use a lot in my work and have personally tested on >250 documents. I'll do some test myself tomorrow to see if I can spot anything special about this.

vahidrezanezhad commented 3 years ago

Sure. Without the curved_line parameter set to true I don't seem to get any textlines at all, just regions.

Workflow:

ocrd process \
  "sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P model default" \
  "eynollah-segment -I OCR-D-BIN -O OCR-D-SEG -P models default" \
  "calamari-recognize -I OCR-D-SEG -O OCR-D-OCR -P checkpoint_dir qurator-gt4histocr-1.0"

part of XML ( SEG)

Could you please install opencv-python == 4.2.0.34 . This should solve the issue.

aurichje commented 3 years ago

Could you please install opencv-python == 4.2.0.34 . This should solve the issue.

Thanks for weighing in on this @vahidrezanezhad. I uninstalled opencv-python-headless (4.5.2.54) and installed opencv-python (4.2.0.34) using pip in the venv, but this did not solve the issue.

I also tried the docker version with: docker run --rm -u $(id -u):$(id -g) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd-eynollah-segment -I OCR-D-BIN -O OCR-D-SEG -P models models_eynollah

Which produced the same results.

I figured this must be a local issue, so I set up OCR-D on a server running ubuntu 18.04 (I'm running elementary OS on my machine) this morning. Again, same issue (for both of the opencv versions).

vahidrezanezhad commented 3 years ago

Could you please install opencv-python == 4.2.0.34 . This should solve the issue.

Thanks for weighing in on this @vahidrezanezhad. I uninstalled opencv-python-headless (4.5.2.54) and installed opencv-python (4.2.0.34) using pip in the venv, but this did not solve the issue.

I also tried the docker version with: docker run --rm -u $(id -u):$(id -g) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd-eynollah-segment -I OCR-D-BIN -O OCR-D-SEG -P models models_eynollah

Which produced the same results.

I figured this must be a local issue, so I set up OCR-D on a server running ubuntu 18.04 (I'm running elementary OS on my machine) this morning. Again, same issue (for both of the opencv versions).

I think uninstalling opencv-python-headless is not a good idea. The point is that when you install eynollah by "pip install .", opencv-python-headless is installed but the opencv-python is not installed. The other point is we need opencv-python <= 4.2.0.34. So again try it with both opencv-python-headless === 4.5.1.48 and opencv-python == 4.2.0.34 installed. This should solve the problem. I have reproduced this error and by changing the requirements I've solved it.

vahidrezanezhad commented 3 years ago

You can also share your output of "pip list ". This is going to help us to resolve your problem too.

vahidrezanezhad commented 3 years ago

@aurichje please just check that at the end you have opencv-python-headless === 4.5.1.48 and opencv-python == 4.2.0.34 installed .

aurichje commented 3 years ago

I think uninstalling opencv-python-headless is not a good idea. The point is that when you install eynollah by "pip install .", opencv-python-headless is installed but the opencv-python is not installed. The other point is we need opencv-python <= 4.2.0.34. So again try it with both opencv-python-headless === 4.5.1.48 and opencv-python == 4.2.0.34 installed. This should solve the problem. I have reproduced this error and by changing the requirements I've solved it.

I reinstalled opencv-python-headless==4.5.1.48 (and not 4.5.2.54, what I had before) and that fixed it when using curved_line false. For curved_line true I still get the reversed order -- however that's not a problem for me as I don't need to use the curved_line option.

Thank you so much for your help!

If still relevant, here's the output of `pip list` ```code Package Version --------------------------- ----------- absl-py 0.13.0 astor 0.8.1 atomicwrites 1.4.0 attrs 21.2.0 bagit 1.8.1 bagit-profile 1.3.1 cached-property 1.5.2 calamari-ocr 0.3.5 certifi 2021.5.30 chardet 4.0.0 click 8.0.1 colorama 0.4.4 cycler 0.10.0 decorator 4.4.2 Deprecated 1.2.0 dinglehopper 0.0.0 edit-distance 1.0.4 eynollah 0.0.5 Flask 2.0.1 gast 0.2.2 google-pasta 0.2.0 grpcio 1.38.0 h5py 2.10.0 idna 2.10 imageio 2.9.0 importlib-metadata 4.5.0 imutils 0.5.4 itsdangerous 2.0.1 Jinja2 3.0.1 joblib 1.0.1 jsonschema 3.2.0 Keras 2.3.1 Keras-Applications 1.0.8 Keras-Preprocessing 1.1.2 kiwisolver 1.3.1 lxml 4.6.3 Markdown 3.3.4 MarkupSafe 2.0.1 matplotlib 3.4.2 multimethod 1.3 networkx 2.5.1 numpy 1.18.5 ocrd 2.24.0 ocrd-cis 0.1.5 ocrd-modelfactory 2.24.0 ocrd-models 2.24.0 ocrd-olahd-client 0.0.1 ocrd-repair-inconsistencies 0.0.0 ocrd-tesserocr 0.12.0 ocrd-utils 2.24.0 ocrd-validators 2.24.0 ocrd-wrap 0.1.7 opencv-python 4.2.0.34 opencv-python-headless 4.5.1.48 opt-einsum 3.3.0 Pillow 8.2.0 pip 21.1.2 prettytable 2.1.0 protobuf 3.17.3 pyparsing 2.4.7 pyrsistent 0.17.3 python-bidi 0.4.2 python-dateutil 2.8.1 python-Levenshtein 0.12.2 PyWavelets 1.1.1 PyYAML 5.4.1 requests 2.25.1 requests-toolbelt 0.9.1 scikit-image 0.18.1 scikit-learn 0.24.2 scipy 1.6.3 setuptools 57.0.0 Shapely 1.7.1 six 1.16.0 tensorboard 1.15.0 tensorflow-estimator 1.15.1 tensorflow-gpu 1.15.5 termcolor 1.1.0 tesserocr 2.5.2b0 threadpoolctl 2.1.0 tifffile 2021.6.14 tqdm 4.61.1 typing-extensions 3.10.0.0 uniseg 0.7.1.post2 urllib3 1.26.5 validators 0.18.2 wcwidth 0.2.5 Werkzeug 2.0.1 wheel 0.36.2 wrapt 1.12.1 XlsxWriter 1.4.3 zipp 3.4.1 ```
vahidrezanezhad commented 3 years ago

I think uninstalling opencv-python-headless is not a good idea. The point is that when you install eynollah by "pip install .", opencv-python-headless is installed but the opencv-python is not installed. The other point is we need opencv-python <= 4.2.0.34. So again try it with both opencv-python-headless === 4.5.1.48 and opencv-python == 4.2.0.34 installed. This should solve the problem. I have reproduced this error and by changing the requirements I've solved it.

I reinstalled opencv-python-headless==4.5.1.48 (and not 4.5.2.54, what I had before) and that fixed it when using curved_line false. For curved_line true I still get the reversed order -- however that's not a problem for me as I don't need to use the curved_line option.

Thank you so much for your help!

If still relevant, here's the output of pip list

the order of textlines with curved_lines on is another issue. I will resolve it too

mikegerber commented 3 years ago

So the workaround is having to install an old version of opencv-python and(!) an old version of opencv-python-headless?

vahidrezanezhad commented 3 years ago

So the workaround is having to install an old version of opencv-python and(!) an old version of opencv-python-headless?

At this moment, yes but I will try to resolve this within the code in order to make it compatible with the latest OpenCV-python-headless version.

kba commented 3 years ago

The problem was a combination of an oversight on our side (https://github.com/vahidrezanezhad/eynollah/commit/d1330ffb805b117fe324e9c0dc90eba06633dbb2) and an API change in OpenCV. Fixed in v0.0.6.