pkp / pkp-lib

The library used by PKP's applications OJS, OMP and OPS, open source software for scholarly publishing.
https://pkp.sfu.ca
GNU General Public License v3.0
306 stars 445 forks source link

Extend XML native import/export plugin to support alternatives to embedding #5372

Closed asmecher closed 4 years ago

asmecher commented 4 years ago

Currently the XML import/export plugin relies on embedding Base64-encoded files directly in the XML (PDFs etc). This quickly causes the XML to become enormous, memory-intensive and hard to work with.

It would be better to support a variety of options, e.g. external hrefs. (I believe these are currently supported, at least in principle, for import -- but not for export.)

This in turn would permit the PLN plugin to work with zipping files into the deposit but outside the XML, which would probably increase reliability considerably.

asmecher commented 4 years ago

(@mjordan, @michaeljoyce, @defstat, just a heads-up regarding the PLN...)

mjordan commented 4 years ago

@asmecher thanks, this is definitely an improvement over what we have now. We'll need to do some work on the PN staging server side to not decode the XML, but I don't think it's much. @michaeljoyce will (also tagging @ubermichael since that's his other GH account).

asmecher commented 4 years ago

@NateWr pointed out that I hadn't clarified how exactly the import/export plugin should be pointing to content it exports, when content is not to be embedded.

Here's how I think it should work:

mjordan commented 4 years ago

@asmecher there might be a way to work around the choice of what binaries to encode in the XML. The plugin currently creates a Bag with the exported XML and some PN-specific metadata. The staging server then retrieves the Bag (its URL is in the SWORD message sent to the staging server), unpacks it, processes the contents, creates a new Bag, and registers the new Bag the the LOCKSS side of the pipeline.

BagIt includes a feature we have not taken advantage of called fetch.txt. This is a simple file containing the URLs of content that are not packaged in the Bag but that can be fetched by applications consuming the Bag. @ubermichael suggested that the plugin could be modified to not package up any galley content but instead just supply URLs in the Bag's fetch.txt file that the staging server could then retrieve.

This would allow the plugin to not use the export XML at all, other than for issue and article metadata. In the case of non-open content, an API key could be added to the Bag that the staging server could use to access the non-open content. Since the staging server creates its own Bag for long-term storage in the PN, the preserved package need not (and should not) contain any API keys.

By not using Export XML would lose the advantage when it comes to reimport on a trigger event, but that might actually be an advantage since it would remove the issue of future compatibility with the preserved XML. To reimport into OJS after a trigger event, we'd use whatever bulk import method was best at that time, e.g. via REST, regenerated Import XML version X, etc.

asmecher commented 4 years ago

@mjordan, interesting, I wasn't aware of that. Once the additional capabilities for the native XML export plugin are added as decsribed here, the PLN plugin would be able to opt for that approach without needing changes outside the plugin code -- so if we decide to use fetch.txt, then the changes described here will serve us well.

However, I think I'd prefer to just stick the files in the archive for now, as normal zip entries rather than Base64-encoding them into the XML. I'm pretty sure PHP can handle big zip archives without trouble; it's the cumbersome XML that's causing us problems. If the zip size turns out to be an issue, we can delegate to fetch.txt. Sound OK?

mjordan commented 4 years ago

I think either zipping up binaries and retrieving the resulting Bag or fetching each file via fetch.txt achieves the same goal (not generating massive XML files), so I think either approach is better than what we do now.

But to clarify, do you still want to encode PDFs in the Export XML or write them out as binaries?

mjordan commented 4 years ago

Sorry, didn't answer your "OK?" question - yes. Just want to clarify the PDF part.

asmecher commented 4 years ago

No, the goal of this is to avoid embedding PDFs in XML -- they would be put in the ZIP, but outside the XML, instead.

quoideneuf commented 4 years ago

@asmecher Here are some changes that add 2 flags to the NativeImportExportPlugin, for example:

php tools/importExport.php NativeImportExportPlugin export --no-embed --use-file-urls testdata/test.xml publicknowledge articles 8

I'm unsure about how things should work on the import side. Currently, the exports without the --use-file-urls flag seem to import correctly, but when the files are referenced by URL, the importing side encounters a not-authorized response (which currently gets saved in lieu of the intended file content).

https://github.com/pkp/pkp-lib/pull/5524 https://github.com/pkp/ojs/pull/2628

NateWr commented 4 years ago

cc @defstat ^ we were talking about this the other day.

asmecher commented 4 years ago

Thanks, @quoideneuf! I've merged the PRs to OJS stable-3_1_2 and cherry-picked forward to master. (Heads-up @defstat that you'll need to rebase over this -- I don't think the changes are likely to conflict, though.)

@quoideneuf, could you port to OMP stable-3_1_2 as well? Then I'll cherry-pick forward to OMP master as well.

quoideneuf commented 4 years ago

@asmecher Here is the PR for OMP.

https://github.com/pkp/omp/pull/769

asmecher commented 4 years ago

Thanks, @quoideneuf, all merged and cherry-picked!

asmecher commented 4 years ago

Patch for OJS 3.1.0, OJS 3.1.1, and OJS 3.1.2: 5372.diff.txt

To apply:

cd /path/to/ojs
wget -O - -q "https://github.com/pkp/pkp-lib/files/4383251/5372.diff.txt" | patch -p1 --dry-run

If there are no FAILURES (a few offset notices are fine), then run it again without the --dry-run option:

wget -O - -q "https://github.com/pkp/pkp-lib/files/4383251/5372.diff.txt" | patch -p1