Closed crism closed 9 years ago
This requires:
Another LibreOffice conversion (from Docx -> PDF for input to CERMINE) CERMINE call Stitching XML outputs from meTypeset body+back and CERMINE front matter together
Looks like we should be using the standalone CERMINE jar via command-line? Or do we want to try to make a crossover call from PHP into Java? (Is that even possible?)
I don't think there's gonna be anything more sensible than exec()
Integrate DocX→PDF into DocxConversion
module? Or make a new one? (Would need one new converter and one new queue job. Do we want to keep one of each in each module, or is it OK to double up?)
Good question. I think I'd probably lean toward making it part of the DocxConversion module so that we aren't potentially duplicating any LibreOffice env scaffolding in two different places in our stack? But I can't immediately think of whether calling one module two different ways at different points in the conversion will pose problems for the way we've built this (it shouldn't).
Let’s do that, then. DocxConversion\Model\Converter\Pdf
and …\Queue\PdfJob
, with associated test classes?
Perfect.
Dur. The existing unoconv path will do this, if we set the output filter type to pdf
.
Yup. Sorry I didn't think to say that explicitly when I was talking about making it part of the same "env scaffolding" (needlessly indirect language on my part) earlier.
We do need a new job class in the queue. Which now makes me feel a little guilty about shoehorning this into DocxConversion
after all; this module was intended for making DocX, not for converting from it. We’re kind of hijacking it… I think we might be better off cloning it, now that I understand how it works better.
I'm equally leery of creating three different modules to support Cermine, but I think that might be more of a documentation issue than anything else.
It seems the approach so far has been to have very small modules, all tightly focused. CERMINE support will require word processing-to-PDF, but it seems like that would be a possibly generally-useful module, would it not? Or do we want to focus users on the existing PdfConversion
which converts to HTML, thence to PDF?
The existing PdfConversion is definitely further in scope from the existing DocxConversion, though, due to the difference between calling unoconv/libreoffice and calling wkhtmltopdf/any-future-xsl-fo-or-cassius-regions implementation. Maybe rename the existing PdfConversion module to finalPDFRender and make the new one librePDFConvert or something, and you can duplicate the existing DocxConversion almost 1:1 with a different parameter passed to uno?
That’s where I was heading… we need a better naming convention, I think, incorporating both input and output. I don’t want to change everything (at least not in the middle of a sprint!), but we can just call the new one as you suggest: WpPdfConversion
, maybe?
Deeply sensible :)
I’ll do that, then.
@axfelix: Got a good test input PDF for CERMINE testing?
Current plan is to use CERMINE to extract the article content from PDF, then pass that along to the reintegration queue.
Below is the (abridged) result of running CERMINE on module/XmpConversion/test/assets/document.pdf
. Is this what we want?
<article>
<front>
<journal-meta />
<article-meta>
<abstract>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris ultri…</p>
</abstract>
</article-meta>
</front>
<body>
<sec id="1">
<title>-</title>
<p>Mauris ipsum diam, iaculis blandit turpis eget, volutpat adipiscing diam…</p>
</sec>
</body>
<back>
<ref-list />
</back>
</article>
Define "abridged" -- the front matter should be populated (and pretty well)?
and w/r/t running it on text strings rather than PDFs -- that shouldn't be nearly as good, as it's trained to use typesetting cues (layout, word distance, etc.) to parse out different metadata elements.
@axfelix: “Abridged”: I chopped out most of the paragraphs, indicated by ellipses. Otherwise, that code block is exactly what I got.
And yeah, I changed my mind on the strings. I have it installed via Composer, and ready to go, as soon as I figure out what we want to do with it…
So you're saying you didn't get anything meaningful in article-meta, eh? That's not consistent with my results... let me dig up the way I was running it.
Using the example PDF files at the CERMINE home page gives better results for the article metadata. The example in XmpConversion
doesn’t have anything to extract, I guess.
Okay, I just built from Cermine's github and it's working for me:
java -cp cermine.jar pl.edu.icm.cermine.PdfNLMContentExtractor -path ~/Desktop/test.pdf > ~/Desktop/cermineout.txt
input and output: http://pkp-udev.lib.sfu.ca/parsingdev/cermine/
Extraction works now, but finfo_file()
reports that the MIME type of the result is text/x_c++
, inexplicably. I don’t want to nerf the test, in case garbage ends up in there, but I haven’t found a way around this yet.
That's quite bizarre! Are you just catting the output to filename.XML?
Hrm; example1.pdf
yields text/x_c++
as its extracted MIME type, but example[23].pdf
give text/html
. PHP has a messed-up magic number file, methinks. And yes, @axfelix, the output file is called .xml
.
I guess I’d better just disable that test. Sad.
Yeah.
We have a converter and a queue for content extraction. @axfelix, is your piece ready to wire up to the end of that pipe? (Hello, mixed metaphor!)
I'm about to give a talk here but I'll finish it afterward. You may want to look at my most recent commit on master if you have more time before then and you'd rather finish it up ... it's pretty half-done. I don't think I want to use the XSL approach after all; given the stitching it's probably just easier catting a couple file together
I'm pretty tempted to chuck everything I wrote yesterday and just put in the equivalent of:
<?php
$cermine_output = file_get_contents(cermine_xml);
$metypeset_output = file_get_contents(metypeset_xml);
$cermine_front = preg_replace("<body>.*", "", $cermine_output);
$metypeset_body_and_back = preg_replace(".*?</front>", "", $metypeset_output);
$munged_xml = $cermine_front . $metypeset_body_and_back
but this is why I have a poor track record for high-level php engineering.
How do we drive that? (See my question about how these bits wire together—where are the prerequisites and dependencies given?)
The most sensible way is to work from the commits I started yesterday, minus the XSL, since (given that we're concatenating two documents anyway) I honestly think it'll be less trouble to permit the sin of parsing XML with regex. but right now that's just the clumsy beginnings of a fork of the existing PDF conversion module. I'll have a crack at finishing it ...
aaaaaaaaaaand just getting back to this.
OK, just looked over your stuff and it seems great (added a few minor comments) -- can you confirm for me that you've at this point added two basically-complete, standalone queues + modules, and all that's needed is the third one that I started, then wiring them all in (as opposed to writing some overarching structure for the three CERMINE-related calls)?
I believe that’s correct: there are two new queues, to which your third can be added.
I didn’t think the stage constant values were meaningful! With this tool, we have a fork in the toolchain, and options to offer users. With my local version (which is now working, sort of), I don’t see any UI for controlling the workflow—is that intended?
At this point, I think we're OK having the two forking paths "join up" (i.e., by merging CERMINE's front matter output with meTypeset's body and back, and using that XML for the HTML/PDF/XMP/ePub transformations afterward). Eventually, we still want to make use of CERMINE's body output in a more meaningful way (runtime comparison), but that should still be a non-interactive process until one of our dev partners steps up with a preliminary WYSIWYG interface.
However, there's a definite rationale for API calls that only utilize part of the workflow, for mid-stream input (e.g. if a user already has their own manually typeset XML and wants HTML/PDF/XMP/ePub) or returning mid-stream output (if a user has a doc or pdf and wants just the BibTeX containing the citations). This came up a few times last week when we mentioned our nascent OJS plugin (https://github.com/pkp/ojs-markup) and some folks expressed wanting to incorporate different workflows. This is of definite interest once we have baseline CERMINE and ePub modules working, probably second only to upstream meTypeset contributions to improve the validity of output (as well as the accuracy, once our test suite is fully operational). It does mean getting up to speed on OJS plugin code, though, and I'm not in too much of a hurry to pull you off of the stack work now that you have a good handle on it.
The stage constant values are addressed in the API documentation at the end of the current Github readme, btw.
I merged master in, so we can make sure #11 and #15 can coexist (and they seem to). All the unit tests pass. What else needs done?
My commits are still a little incomplete and I haven't quite found the headspace to polish them off properly yet (not enough PHP brain in the past few days to figure out how to properly take two different files as input to a module). If you want to have a look I certainly won't be offended but otherwise I'll try to finish it myself this week.
As per #11, I added manager logic to start the WP-to-PDF and CERMINE jobs.
The tough part is how to trigger the recombination job—since that depends on two jobs being complete. My best idea:
Pick one predecessor—whichever is likely to be slower?—and kick off the combination job. That job would monitor the database (gently) until both predecessors are confirmed complete, then actually start its work.
It would kind of need to do that, anyway, to find both of its inputs, since only a single input document is normally provided.
Does that make sense?
Thanks! That's about as far as my thinking had gotten, too ... I'll finally have a chance to finish my module tomorrow and I'll see if that way works in practice. I can't immediately see why it shouldn't!
Just took another pass at this, but handling the two different input file paths could use some work -- would you mind taking a look when you get a chance? In https://github.com/pkp/xmlps/commit/78616aa8e095fb153229fe0b795acc2928d48a7d. Appreciate it.
Specifically, I tried to crib some logic around outputTmpPath from the existing PDF conversion module, and I think I'm overloading setInputFile, but I'm not immediately clear on the best way to paste the existing metypeset.xml and cermine.xml outputs to a working directory based on the output of the prior modules (which I haven't touched). If you grep for every instance of those two calls, you should see everything that's currently ambiguous.
I’ll meditate on this. I’ll be out this evening, but may spend more time with it when I get back.
Commented in Slack, but not here. This worked, on the server, for a simple test, but stalled out on a more complicated one (eeg_comicsans
). Curiously, the job status went back to “Pending,” instead of “Processing”—something is returning a 0 somewhere that shouldn’t. Still, all the individual pieces are working, and sometimes working when wired together.
I think the status was a canard. Still seeing the same results, locally, though. Trivial lorem ipsum (document.odt
) gives:
| 2015-09-23 16:55:08 | 6 | Job 305 is completed |
| 2015-09-23 16:55:07 | 6 | Queued job (305) in queue zip |
| 2015-09-23 16:55:06 | 6 | Queued job (305) in queue merge |
| 2015-09-23 16:54:50 | 6 | Queued job (305) in queue cermine |
while a more complicated one, with references (eeg_comicsans.doc
), gives:
| 2015-09-23 16:58:34 | 6 | Queued job (306) in queue merge |
| 2015-09-23 16:57:47 | 6 | Queued job (306) in queue cermine |
… it’s now been in the merge
queue for five minutes, and is clearly going nowhere. Why?
Hey, this is definite progress! For the queue issue, though... merge shouldn't be running until after cermine does ... do we have it firing too early?
The database listing there is in reverse time order. The merge
starts after cermine
, then zip
, then done.
Integrate https://github.com/CeON/CERMINE as a module.