openpreserve / jpylyzer

JP2 (JPEG 2000 Part 1) validator and properties extractor. Jpylyzer was specifically created to check that a JP2 file really conforms to the format's specifications. Additionally jpylyzer is able to extract technical characteristics.
http://jpylyzer.openpreservation.org/
Other
69 stars 28 forks source link

Feature request: User Manual as Markdown Extra, figures as SVG #50

Closed bitsgalore closed 10 years ago

bitsgalore commented 10 years ago

The source document of the User Manual is currently an MS Word document, which is a bit awkward to edit/maintain. Apart from that it creates a dependency on proprietary software (LibreOffice / OpenOffice will mess up the layout). Some of the figures were originally created in MS Powerpoint, has similar problems.

An alternative would be to migrate the User Manual to Markdown Extra (which includes table support), and to provide the figures as SVG. This would lower the barrier to contributing to the User Manual, and it would simplify things as well. In combination with a tool like Pandoc it would also enable us to generate versions of the User Manual in pretty much any desired delivery format (HTML, PDF, EPUB, etc.).

How to do this?

Possible workflow:

  1. Convert Word doc to HTML
  2. Clean up HTML if needed (e.g. HTMLTidy)
  3. Convert HTML to Markdown. In Pandoc:

    pandoc -f html -t markdown_phpextra umanual.html > umanual.md

  4. Manually clean up result
bitsgalore commented 10 years ago

Note: getting tables right is tricky this way. See below HTML, which is what I got after exporting from MS Word following some cleaning up by Tidy:

<table class="MsoNormalTable c7" border="1" cellspacing="0" cellpadding="0">
    <tr>
      <td width="229" valign="top" class='c1'>
        <p class="Tablecellheading"><span lang="EN-GB" xml:lang="EN-GB">Test
        name</span></p>
      </td>

      <td width="310" valign="top" class='c2'>
        <p class="Tablecellheading"><span lang="EN-GB" xml:lang="EN-GB">True
        if</span></p>
      </td>
    </tr>

    <tr>
      <td width="229" valign="top" class='c3'>
        <p class="Tablecell"><span lang="EN-GB" xml:lang=
        "EN-GB">boxLengthIsValid</span></p>
      </td>

      <td width="310" valign="top" class='c4'>
        <p class="Tablecell"><span lang="EN-GB" xml:lang="EN-GB">Size of box contents
        equals 4 bytes</span></p>
      </td>
    </tr>

    <tr>
      <td width="229" valign="top" class='c5'>
        <p class="Tablecell"><span lang="EN-GB" xml:lang=
        "EN-GB">signatureIsValid</span></p>
      </td>

      <td width="310" valign="top" class='c6'>
        <p class="Tablecell"><span lang="EN-GB" xml:lang="EN-GB">Signature equals
        0x0d0a870a</span></p>
      </td>
    </tr>
  </table>

Pandoc does not convert it to a nicely formatted Markdown table. After some experimentation I could make the above example work after the following steps:

  1. strip all p and span subelements from each td element
  2. Wrap first row in thead element
  3. Change td elements in first row to th
  4. wrap remainder of table in tbody element

This produces something like this:

  <table class="MsoNormalTable c7" border="1" cellspacing="0" cellpadding="0">
    <thead>
      <tr>
        <th width="229" valign="top" class='c1'>Test name</th>

        <th width="310" valign="top" class='c2'>True if</th>
      </tr>
    </thead>

    <tbody>
      <tr>
        <td width="229" valign="top" class='c3'>boxLengthIsValid</td>

        <td width="310" valign="top" class='c4'>Size of box contents equals 4 bytes</td>
      </tr>

      <tr>
        <td width="229" valign="top" class='c5'>signatureIsValid</td>

        <td width="310" valign="top" class='c6'>Signature equals 0x0d0a870a</td>
      </tr>
    </tbody>
  </table>

Throwing this at Pandoc produces:

|Test name|True if|
|:--------|:------|
|boxLengthIsValid|Size of box contents equals 4 bytes|
|signatureIsValid|Signature equals 0x0d0a870a|

Which will render as:

Test name True if
boxLengthIsValid Size of box contents equals 4 bytes
signatureIsValid Signature equals 0x0d0a870a

So the trick here will be to automate the above changes throughout the document.

bitsgalore commented 10 years ago

All done, see:

https://github.com/openplanets/jpylyzer/tree/master/doc

This is now used to produce an online version of the documentation:

http://openplanets.github.io/jpylyzer/userManual.html

Export to delivery formats other than HTML needs more work ...