qurator-spk / eynollah

Document Layout Analysis
Apache License 2.0
332 stars 27 forks source link

Memory usage explosion with very narrow images (e.g. book spine) #67

Open mikegerber opened 2 years ago

mikegerber commented 2 years ago

With this document (PPN894261851.zip) we experienced an OOM error. Further investigation revealed this memory usage (measured using procpath):

eynollah vs Buchrücken drawio

The culprit seems to be this "page" from the document - an image of a book spine:

FILE_0017_MAX tif

Relevant parts from the log output:

18:25:30.757 INFO eynollah - INPUT FILE PHYS_0017 (17/18)
18:25:30.780 INFO eynollah - resize and enhance image
18:25:30.780 INFO eynollah - Detected 25 DPI
18:25:40.756 INFO eynollah - Found 5 columns ([[4.1955504e-01 1.7818451e-13 2.7631987e-21 7.5972243e-22 5.8044493e-01
  0.0000000e+00]])
18:31:39.449 INFO eynollah - Image is enhanced
18:31:40.369 INFO eynollah - Enhancing took 369.5891568660736s
18:31:47.043 INFO eynollah - Image dimensions: 448x672
18:43:35.935 INFO eynollah - Image dimensions: 224x448
18:52:07.638 INFO eynollah - Image dimensions: 448x672
19:01:28.031 INFO eynollah - Textregion detection took 1787.6620445251465s
19:01:36.604 INFO eynollah - Graphics detection took 8.571088552474976s
19:01:36.604 INFO eynollah - cont_page [array([[  519,   445],
       [ 4404,   445],
       [ 4404, 27685],
       [  519, 27685]])]
19:01:41.160 INFO eynollah - Image dimensions: 448x672
19:08:15.645 INFO eynollah - textline detection took 399.04073786735535s
19:26:32.295 INFO eynollah - slope_deskew: -90.0
19:26:32.451 INFO eynollah - deskewing took 1096.8060252666473s
19:26:33.040 INFO eynollah - detection of marginals took 0.5885534286499023s
19:26:55.466 INFO eynollah - Image dimensions: 896x896
19:27:51.663 INFO eynollah - Image dimensions: 896x896
19:34:22.576 INFO eynollah - areas_cnt_text [1.60449940e-05 3.67248936e-05 4.69396395e-05 1.78734430e-05
 6.68446924e-05 1.59316018e-05 2.67794541e-05 3.35782605e-05
 2.04153178e-05 1.02601028e-04 1.49299709e-05 2.50974700e-05
 1.09640792e-04 4.56729543e-04 1.69521315e-05 7.82122588e-05
 9.06334276e-05 2.25603199e-04 1.58796304e-05 4.07455914e-05
 1.44858515e-05 1.97103964e-04 3.92242463e-05 2.14925435e-05
 2.01601854e-05 1.57520642e-05 1.14313495e-04 2.90331237e-05
 1.44291554e-04 2.15615238e-04 3.12064739e-05 4.46585667e-04
 2.03675986e-04 4.18700639e-05 2.75817038e-04 2.86669615e-04
 4.78515016e-05 1.76816212e-04 2.13172581e-04 2.02211337e-04
 3.27372684e-05 1.72403366e-05 1.62434303e-05 3.26522243e-05
 2.49226571e-05 1.41551243e-05 2.55297777e-04 2.39352001e-05
 1.48591008e-05 1.77080794e-05 1.41844173e-04 7.28828262e-05
 1.27079565e-04 1.09125803e-04 5.03886517e-05 1.61253135e-05
 2.59356273e-04 3.43578317e-05 1.49417826e-04 1.00711158e-04
 1.49819423e-05 5.42553252e-04 2.48706857e-05 2.26875554e-03
 4.71257916e-04 8.13966893e-05 7.39080805e-05 4.21195267e-04
 3.22033802e-05 2.35572262e-04 2.46580753e-05 2.20656465e-04
 2.95670119e-05 1.99759231e-05 4.83650737e-04 2.61520173e-04
 1.14686745e-04 5.78111151e-05 1.14729267e-04 1.89081467e-05
 1.68529133e-04 1.66998339e-04 1.72875834e-05 2.23552691e-04
 1.04831546e-03 6.28268293e-04 5.47693697e-04 1.98365452e-04
 2.78094331e-05 6.26397322e-05 5.01098959e-05 1.08133621e-04
 9.64258784e-05 5.27179162e-05 6.81203545e-05 1.25246392e-04
 7.48104933e-04 8.99908719e-05 6.32440181e-04 1.75379911e-05
 9.17437261e-05 3.56807405e-05 3.17781595e-05 2.56077349e-05
 1.14162306e-04 3.40275770e-04 1.91113077e-05 2.73133423e-05
 2.53143326e-04 4.32118714e-05 1.93848663e-04 3.59594963e-05
 1.95918070e-04 1.34687236e-03 1.60180634e-04 2.35761249e-05
 6.63717525e-04 4.14731913e-05 1.89790168e-05 1.82136195e-05
 1.86530142e-05 2.08773909e-04 2.22569958e-04 3.77780235e-04
 4.02589500e-05 5.98474497e-05 1.02081314e-04 3.75233635e-05
 4.72098908e-04 5.47306274e-05 1.23058868e-04 1.49281755e-03
 8.34802707e-05 1.13349662e-04 2.02093220e-04 2.57681376e-03
 2.15686108e-04 5.79150579e-05 4.43079958e-05 2.98197820e-04
 2.61132750e-05 8.44677276e-05 5.68189335e-05 3.62051794e-05
 7.14342410e-04 1.95589233e-03 1.87621542e-04 2.56549816e-05
 1.75568898e-05 1.43630100e-05 9.49763483e-04 5.73769175e-04
 3.36840932e-04 1.75474405e-05 1.04953916e-04 6.89329984e-05
 6.42224981e-05 2.66504705e-04 6.18412623e-05 5.68283828e-05
 2.05906032e-04 1.20568964e-04 2.07554943e-05 2.06421021e-05
 6.66509807e-05 3.16127959e-05 1.37913244e-05 7.39458779e-05
 3.90399840e-05 2.61038257e-05 2.60187815e-05 2.02953110e-04
 4.78609509e-05 1.26876404e-04 8.87908046e-05 4.99917791e-05
 2.68890665e-04 4.74404549e-05 1.45269562e-04 1.67092832e-04]
19:43:02.340 INFO eynollah - Job done in 4651.560835599899s

This log output is not from the OOM, but another run I did on a different machine to investigate the problem. If I interpret the cont_page part correctly, the image is blown up to [ 4404, 27685], which would certainly explain the OOM error on the other machine.

Reproduce with ocrd-eynollah-segment -I MAX -O TEST-SEGMENT -P models /path/to/models.

mikegerber commented 2 years ago

While eynollah should handle this gracefully, we should also consider how to handle irrelevant images that are already marked as such in the METS structMap. In this case possibly spine and colour_checker (could also be SBB defined types):

  <mets:structMap TYPE="LOGICAL">
    <mets:div ADMID="AMD" CONTENTIDS="http://resolver.staatsbibliothek-berlin.de/SBB000205BC00000000" DMDID="DMDLOG_0000" ID="LOG_0000" LABEL="Disputationum Medicarum Undecima, De Chirurgia" ORDERLABEL="Disputationum Medicarum Undecima, De Chirurgia" TYPE="monograph">
      <mets:div ID="LOG_0001" TYPE="binding">
        <mets:div ID="LOG_0002" TYPE="cover_front"/>
        <mets:div ID="LOG_0003" TYPE="paste_down"/>
        <mets:div ID="LOG_0004" TYPE="endsheet">
          <mets:div ID="LOG_0005" TYPE="contents"/>
        </mets:div>
      </mets:div>
      <mets:div ID="LOG_0006" TYPE="title_page"/>
      <mets:div DMDID="DMDLOG_0001" ID="LOG_0007" LABEL="Quaestio Prima. [bis] 44." TYPE="section"/>
      <mets:div ID="LOG_0008" TYPE="binding">
        <mets:div ID="LOG_0009" TYPE="endsheet"/>
        <mets:div ID="LOG_0010" TYPE="paste_down"/>
        <mets:div ID="LOG_0011" TYPE="cover_back"/>
        <mets:div ID="LOG_0012" TYPE="spine"/>
      </mets:div>
      <mets:div ID="LOG_0013" TYPE="colour_checker"/>
    </mets:div>

(Full document: PPN894261851.zip)

@bertsky @kba @cneud What are your thoughts on this?

bertsky commented 2 years ago

Yes, it should be possible to skip pages marked as certain types in the logical structmap – not just in any one processor, but as a general mechanism for workflows in OCR-D.

For the concrete set of supported page types, we should stick to DFG Strukturdatenset, which is strangely missing colour_checker.

This set is also partially supported by ocrd-anybaseocr-layout-analysis:

{'annotation': 0, 'binding': 1, 'chapter': 2, 'colour_checker': 3, 'contained_work': 4, 'contents': 5, 'cover': 6, 'edge': 7, 'endsheet': 8, 'epicedia': 9, 'illustration': 10, 'index': 11, 'musical_notation': 12, 'page': 13, 'paste_down': 14, 'preface': 15, 'provenance': 16, 'section': 17, 'sermon': 18, 'table': 19, 'title_page': 20}

For the general mechanism, I suggest something along the lines of our --page-id CLI option's existing numerical range syntax, but more elaborate. For example, one could define filter operators that can look into the structmap, perhaps XPath expressions with predefined functions?

mikegerber commented 2 years ago

Yes, it should be possible to skip pages marked as certain types in the logical structmap – not just in any one processor, but as a general mechanism for workflows in OCR-D.

For the concrete set of supported page types, we should stick to DFG Strukturdatenset, which is strangely missing colour_checker.

100% agree! Should we take this to an OCR-D core or spec issue? I have some additional thoughts to discuss (like: What happens with skipped pages in the output?)

bertsky commented 2 years ago

Should we take this to an OCR-D core or spec issue?

Yes, we should elevate this to OCR-D/spec.

I have some additional thoughts to discuss (like: What happens with skipped pages in the output?)

There is already some discussion on skip strategies for API changes in spec...

cneud commented 1 year ago

With the current version including https://github.com/qurator-spk/eynollah/issues/67 I was able to

Is there anything relevant from here that is still needed for https://github.com/OCR-D/spec/issues/172#issuecomment-693593327 or can we close this?

mikegerber commented 10 months ago

With the current version including #67 I was able to

* process `FILE_0017_MAX.tif` successfully without memory explosion7

* process the whole document PPN894261851 using the `-di` flag without running into memory issues

Is there anything relevant from here that is still needed for OCR-D/spec#172 (comment) or can we close this?

I wouldn't know, the current version is not working for OCR-D and so I can't reproduce until it's fixed. (Yes, there is a elaborate workaround but I am not willing to invest the time to reproduce with a lengthy changeset (#86) missing.)