Closed asmecher closed 4 years ago
(@mjordan, @michaeljoyce, @defstat, just a heads-up regarding the PLN...)
@asmecher thanks, this is definitely an improvement over what we have now. We'll need to do some work on the PN staging server side to not decode the XML, but I don't think it's much. @michaeljoyce will (also tagging @ubermichael since that's his other GH account).
@NateWr pointed out that I hadn't clarified how exactly the import/export plugin should be pointing to content it exports, when content is not to be embedded.
Here's how I think it should work:
@asmecher there might be a way to work around the choice of what binaries to encode in the XML. The plugin currently creates a Bag with the exported XML and some PN-specific metadata. The staging server then retrieves the Bag (its URL is in the SWORD message sent to the staging server), unpacks it, processes the contents, creates a new Bag, and registers the new Bag the the LOCKSS side of the pipeline.
BagIt includes a feature we have not taken advantage of called fetch.txt. This is a simple file containing the URLs of content that are not packaged in the Bag but that can be fetched by applications consuming the Bag. @ubermichael suggested that the plugin could be modified to not package up any galley content but instead just supply URLs in the Bag's fetch.txt file that the staging server could then retrieve.
This would allow the plugin to not use the export XML at all, other than for issue and article metadata. In the case of non-open content, an API key could be added to the Bag that the staging server could use to access the non-open content. Since the staging server creates its own Bag for long-term storage in the PN, the preserved package need not (and should not) contain any API keys.
By not using Export XML would lose the advantage when it comes to reimport on a trigger event, but that might actually be an advantage since it would remove the issue of future compatibility with the preserved XML. To reimport into OJS after a trigger event, we'd use whatever bulk import method was best at that time, e.g. via REST, regenerated Import XML version X, etc.
@mjordan, interesting, I wasn't aware of that. Once the additional capabilities for the native XML export plugin are added as decsribed here, the PLN plugin would be able to opt for that approach without needing changes outside the plugin code -- so if we decide to use fetch.txt
, then the changes described here will serve us well.
However, I think I'd prefer to just stick the files in the archive for now, as normal zip entries rather than Base64-encoding them into the XML. I'm pretty sure PHP can handle big zip archives without trouble; it's the cumbersome XML that's causing us problems. If the zip size turns out to be an issue, we can delegate to fetch.txt
. Sound OK?
I think either zipping up binaries and retrieving the resulting Bag or fetching each file via fetch.txt
achieves the same goal (not generating massive XML files), so I think either approach is better than what we do now.
But to clarify, do you still want to encode PDFs in the Export XML or write them out as binaries?
Sorry, didn't answer your "OK?" question - yes. Just want to clarify the PDF part.
No, the goal of this is to avoid embedding PDFs in XML -- they would be put in the ZIP, but outside the XML, instead.
@asmecher Here are some changes that add 2 flags to the NativeImportExportPlugin, for example:
php tools/importExport.php NativeImportExportPlugin export --no-embed --use-file-urls testdata/test.xml publicknowledge articles 8
I'm unsure about how things should work on the import side. Currently, the exports without the --use-file-urls flag seem to import correctly, but when the files are referenced by URL, the importing side encounters a not-authorized response (which currently gets saved in lieu of the intended file content).
https://github.com/pkp/pkp-lib/pull/5524 https://github.com/pkp/ojs/pull/2628
cc @defstat ^ we were talking about this the other day.
Thanks, @quoideneuf! I've merged the PRs to OJS stable-3_1_2
and cherry-picked forward to master
. (Heads-up @defstat that you'll need to rebase over this -- I don't think the changes are likely to conflict, though.)
@quoideneuf, could you port to OMP stable-3_1_2
as well? Then I'll cherry-pick forward to OMP master
as well.
@asmecher Here is the PR for OMP.
Thanks, @quoideneuf, all merged and cherry-picked!
Patch for OJS 3.1.0, OJS 3.1.1, and OJS 3.1.2: 5372.diff.txt
To apply:
cd /path/to/ojs
wget -O - -q "https://github.com/pkp/pkp-lib/files/4383251/5372.diff.txt" | patch -p1 --dry-run
If there are no FAILURES (a few offset notices are fine), then run it again without the --dry-run
option:
wget -O - -q "https://github.com/pkp/pkp-lib/files/4383251/5372.diff.txt" | patch -p1
Currently the XML import/export plugin relies on embedding Base64-encoded files directly in the XML (PDFs etc). This quickly causes the XML to become enormous, memory-intensive and hard to work with.
It would be better to support a variety of options, e.g. external hrefs. (I believe these are currently supported, at least in principle, for import -- but not for export.)
This in turn would permit the PLN plugin to work with zipping files into the deposit but outside the XML, which would probably increase reliability considerably.