internetarchive / epub

For code related to making ePub files
40 stars 3 forks source link

TOC parser broken #35

Open tfmorris opened 8 years ago

tfmorris commented 8 years ago

It's not clear that the _toc.xml files are useful (the few that I examined were pretty incomplete), but the current parsing code doesn't match the schema in the XML files at all. The TOC parsing code expects a simple lists of elements at the top level of _toc.xml which contain <title> and <pageno> elements, but the actual structure is:

    <ocr_analysis>
      <version>1</version>
      <toc>
        <entry>
          <level>1</level>
          <refpage>
            <page>
              <name>7</name>
              <leaf>17</leaf>
              <index>16</index>
            </page>
          </refpage>
          <title>
            <word>
              <text>\9I</text>
              <box>209 401 347 327</box>
            </word>
            <word>
              <text>?</text>
              <box>400 401 447 331</box>
            </word>