pkp / ots

PKP XML Parsing Service
GNU General Public License v3.0
32 stars 19 forks source link

images don't appear to be making it into epub #21

Closed axfelix closed 9 years ago

axfelix commented 9 years ago

This seems unrelated to https://github.com/MartinPaulEve/meTypeset/issues/75 -- we might just need to figure out what paths are expected and pack them into the zip manually.

crism commented 9 years ago

If you can provide a sample for the test assets, I’ll fix it; jats2epub takes an assets directory as an argument, but I misunderstood that our WP files were all self-contained.

axfelix commented 9 years ago

Have a look at http://pkp-udev.lib.sfu.ca/manager/details/id/1579 -- that's a freshly run sample. Ping me on irc if you need admin credentials to see that.

crism commented 9 years ago

OK, those pictures are embedded in the .doc file. I think we are dropping them along the way… I’ll check on that.

crism commented 9 years ago

The .docx file has the image still. The XML, however, has:

<fig position="float" orientation="portrait">
  <graphic xlink:href="media/image1.png"
    id="ID985916b7-0fbc-4e70-8acf-59b300b4d0bd" position="float"
    orientation="portrait" xlink:type="simple">
    <label>Figure 1</label>
    <caption>
      <p>Brain wave samples with dominant frequencies belonging to beta, alpha,
        theta, and delta band</p>
    </caption>
  </graphic>
</fig>

Where did image1.png end up? I think we’re only rescuing the XML file from the result space, and discarding any resulting images. I’ll trace this in code.

axfelix commented 9 years ago

I'm not 100% sure on where the images are stored immediately after the meTypeset XML conversion step, but they do make it into the html.zip output.

We don't use the images directly from the docx because when a document is converted doc -> docx using LibreOffice the images are retained in the obscure .wmf format; meTypeset makes its own calls to unoconv to turn these into PNGs as part of the docx -> xml step.

axfelix commented 9 years ago

(n.b. that my lingering uncertainties about temp file storage are basically the entire reason that my merge module isn't done, so any insights are greatly appreciated :) )

axfelix commented 9 years ago

FYI, most of the "temporary" files output by other modules seem to all just utilize https://github.com/pkp/xmlps/blob/73b03e9260ff95cf4ee7d3c243d02eedd13f52ce/module/Manager/src/Manager/Entity/Job.php#L153.

axfelix commented 9 years ago

... And modules that have complex outputs with their own directory structure like meTypeset seem to wind up in their own subdirectories, e.g. var/www/var/documents/367/380/metypeset. That's probably the extent of the outputtmppath stuff that was confusing me earlier. Make sense?

crism commented 9 years ago

The HTML conversion process goes back to the meTypeset output and grabs the images from there. I think we need to do that with jats2epub, too—just point it to the media directory as part of its assets.

crism commented 9 years ago

The problem is that jats2epub takes an “extras” directory as an argument, and includes everything in that directory in the epub. We have relative references to media/foo.png in our XML. The HTML conversion pulls that from metypeset/media, leftover from the NLM XML conversion. If we give metypeset as the extras argument to jats2epub, we’ll end up with all of the meTypeset residue: common2, docx, media, nlm, tei. If we specify the media directory as the target, the relative paths will break.

The solution is to copy the media directory to a staging area, then include that staging area as the extras. Unfortunately, PHP makes recursive copying a PITA, so I’m putting this off until tomorrow.

axfelix commented 9 years ago

Makes sense.

crism commented 9 years ago

Tomorrow, next week… life intervened. Anyway, I think this works, and would love other eyes on it.

axfelix commented 9 years ago

Attempting to test, but the MySQL on the server died and doesn't seem to want to come back up. Any idea why that might've happened? Will play with it...

axfelix commented 9 years ago

Fixed via apt. Phew. Unfortunately, this doesn't seem to be working, and I also just noticed it seems to be using the non-merged XML (i.e., with the dummy meTypeset front matter) as input to ePub conversion: http://pkp-udev.lib.sfu.ca/manager/details/id/1769

axfelix commented 9 years ago

Hm, wait a second, looks like the .wmf -> .png conversion isn't firing for .doc uploads with embedded media right now. I wonder if libreoffice got upset during/after the apt update. It probably did.

axfelix commented 9 years ago

Just verified PNG conversion is OK but they still don't seem to be making it into the ePub.

crism commented 9 years ago

@axfelix, yes, the epub conversion is based on the meTypeset NLM XML. It was put in place before we started the merging… that could be moved. I’ll make sure the images are working, then open a ticket on that.

crism commented 9 years ago

Indeed, meTypeset relies on unoconv to convert the extracted images, and so had the same listener problem. I’ve added HOME=/tmp to global.php for both stages to attempt to address this.