kba / page-to-alto

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)
Apache License 2.0
14 stars 5 forks source link

PAGE without words #5

Open mikegerber opened 3 years ago

mikegerber commented 3 years ago

@kba asked me to put this comment from a private Gitter conversation into an issue:

bzgl. "input PAGE-XML not having words" wäre mein Input, dass ich damit leben kann wenn PAGE ohne Word-Elemente einfach nicht konvertiert werden kann. Meine Meinung wäre sogar, dass eine Wortsegementierung an dieser Stelle nicht angebracht wäre und das entweder die Layoutsegmentierung oder die OCR machen sollte. (Die OCR auch nur weil aus den CTC-Positionen eine für manche Zweke brauchbare Glyphsegmentierung als Abfallprodukt abfällt und das relativ einfach sich auf Wörter übertragen lässt, wie in ocrd_calamari)

kba commented 3 years ago

cd0bb8dc5099fd5ebdf7a28951b83f4b10a4e267 provides a CLI flag --(no-)skip-empty-lines which allows either skipping empty lines or creating a dummy full-width empty word (the default).

https://github.com/kba/page-to-alto/commit/5a32ea3f3bdb99560dac38ec29fc6ac215a919fb provides a CLI flag --(no-)-check-words which aborts if there aren't any pc:Word in the PAGE-XML before conversion if enabled (default). This will however fail on empty pages - should I check also for any pc:TextLine present to catch that special case?

mikegerber commented 3 years ago

5a32ea3 provides a CLI flag --(no-)-check-words which aborts if there aren't any pc:Word in the PAGE-XML before conversion if enabled (default). This will however fail on empty pages - should I check also for any pc:TextLine present to catch that special case?

Ah, the devil is in the details. Provided that ALTO allows pages without lines (I hope so), I would say: If there are pc:TextLines with non-empty and non-whitespace TextEquiv, check that there are pc:Words in them. (The extra pc:TextLines without text and pc:Words are then to be handled by --(no-)skip-empty-lines behavior.) The warning should be clear enough for users to discover that they need to provide input with pc:Words for the ALTO transformation to work as intended.

kba commented 3 years ago

I've implemented the proposed behavior in 4c6b3bf:

def check_words(self):                                                                                                                                                        
   for reg_page in self.page_page.get_AllRegions(classes=['Text']):                                                                                                                
       for line_page in reg_page.get_TextLine():                                                                                                                                   
           print(line_page)                                                                                                                                                        
           textequiv = line_page.get_TextEquiv()                                                                                                                                   
           if any(x.Unicode for x in textequiv) and not line_page.get_Word():                                                                                                      
               raise ValueError("Line %s has TextEquiv but not words, so cannot be converted to ALTO without losing information. Use --no-skip-words to override" % line_page.id)  
kba commented 3 years ago

In addition, the converter now also handles propagation of pc:TextEquiv of a pc:TextRegion down to a dummy pc:TextLine and to a dummy pc:Word with the --dummy-textline and --dumy-word flags:

With both those flags:

<pc:PcGts xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.
primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd" pcGtsId="OCR-D-OCR-CALAMARI_00001">       
    <pc:Page imageFilename="OCR-D-IMG/044417.jpg" imageWidth="3195" imageHeight="4370" type="content">                                                                       
        <pc:TextRegion id="r0">                                                                                                                                              
            <pc:Coords points="0,0 1,1"/>                                                                                                                                    
            <pc:TextEquiv>                                                                                                                                                   
                <pc:Unicode>CONTENT BUT NO LINES</pc:Unicode>                                                                                                                
            </pc:TextEquiv>                                                                                                                                                  
        </pc:TextRegion>                                                                                                                                                     
    </pc:Page>                                                                                                                                                               
</pc:PcGts>                                                                                                                                                                  

becomes

<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/standards/alto/ns-v4#" xsi:schemaLocation="http://www.loc.gov/standards/alto/v4/alto-4-2.xsd">
  <Description>
    <MeasurementUnit>pixel</MeasurementUnit>
    <sourceImageInformation>
      <fileName>OCR-D-IMG/044417.jpg</fileName>
    </sourceImageInformation>
  </Description>
  <Styles/>
  <Tags/>
  <Layout>
    <Page ID="OCR-D-OCR-CALAMARI_00001" PHYSICAL_IMG_NR="0" WIDTH="4370" HEIGHT="4370">
      <TopMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
      <LeftMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
      <RightMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
      <BottomMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
      <PrintSpace>
        <TextBlock ID="r0" HEIGHT="1" WIDTH="1" HPOS="0" VPOS="0">
          <Shape>
            <Polygon POINTS="0,0 1,1"/>
          </Shape>
          <TextLine ID="r0-dummy-TextLine" HEIGHT="1" WIDTH="1" HPOS="0" VPOS="0">
            <Shape>
              <Polygon POINTS="0,0 1,1"/>
            </Shape>
            <String ID="r0-dummy-TextLine-dummy-Word" HEIGHT="1" WIDTH="1" HPOS="0" VPOS="0" CONTENT="CONTENT BUT NO LINES">
              <Shape>
                <Polygon POINTS="0,0 1,1"/>
              </Shape>
            </String>
          </TextLine>
        </TextBlock>
      </PrintSpace>
    </Page>
  </Layout>
</alto>
M3ssman commented 3 years ago

I haven't seen any ALTO (from Tesseract) like this in 1,5 years. It is using the spatium-element SP to signal in-between whitespace. There shall be no whitespace within the CONTENT. It would be very hard not to just to split the line into words, but support each word token with proper coordinates and dimensions without knowing the font-type. Although these information is optional, it wouldn't make much sense to present stuff like this to a client viewer.

Also I wonder in which contexts PAGE like this might originate. At least not from regular OCR-D-Workflows? Of course, in the wild-side I've seen weird PAGE-files produced by Transkribus, but that is an Transkribus issue. It comes from the data-export, where one can choose if only lines shall be included or also words.

In the context of transforming OCR-D-Pipeline-output to proper client-viewer ALTO I guess this makes no sense, therefore I'd prefer raising an exception or something alike.

kba commented 3 years ago

I haven't seen any ALTO (from Tesseract) like this in 1,5 years. It is using the spatium-element SP to signal in-between whitespace. There shall be no whitespace within the CONTENT.

I understand the reasoning that it doesn't make sense to have a word with spaces in them. But from the PAGE XSD, I don't see this restriction. We do use that for the pseudo-words, i.e. a String with CONTENT being the line-level text and coordinates. I hope that ALTO consumers are robust enough to handle this.

It would be very hard not to just to split the line into words, but support each word token with proper coordinates and dimensions without knowing the font-type. Although these information is optional, it wouldn't make much sense to present stuff like this to a client viewer.

Agreed, trying to implement heuristics to derive words and their coordinates from a line-level TextEquiv is too error-prone to be worth the effort.

Also I wonder in which contexts PAGE like this might originate. At least not from regular OCR-D-Workflows? Of course, in the wild-side I've seen weird PAGE-files produced by Transkribus, but that is an Transkribus issue. It comes from the data-export, where one can choose if only lines shall be included or also words.

Calamari also has this issue, cf. https://github.com/Calamari-OCR/calamari/pull/172. @mikegerber mitigates this in ocrd_calamari though, so I unless you explicitly parameterize a processor with something like -P textequiv_level line, OCR-D output should include Words (except for empty pages obviously).

In the context of transforming OCR-D-Pipeline-output to proper client-viewer ALTO I guess this makes no sense, therefore I'd prefer raising an exception or something alike.

OK, thanks for the feedback. This should be the default behavior - if there is any line-level TextEquiv with no word-level TextEquiv, a ValueError is raised (unless overridden with --no-check-words).

mikegerber commented 3 years ago

Calamari also has this issue, cf. Calamari-OCR/calamari#172. @mikegerber mitigates this in ocrd_calamari though, so I unless you explicitly parameterize a processor with something like -P textequiv_level line, OCR-D output should include Words (except for empty pages obviously).

Side note: -P textequiv_level line is the default, so it is the other way around: You have to explicitly ask for words (e.g. word or even glyph), otherwise the output will not contain them.