inveniosoftware / invenio-previewer

Invenio module for previewing files.
https://invenio-previewer.readthedocs.io
MIT License
5 stars 60 forks source link

Upgrade PDFJS #203

Closed carlinmack closed 1 month ago

carlinmack commented 6 months ago

:heart: Thank you for your contribution!

Close #170 Close https://github.com/zenodo/ops/issues/438 Close https://github.com/zenodo/ops/issues/278

Description

Please describe briefly your pull request.

Upgrade PDF JS from 1.x to 3.x. This corrects the issue with displaying characters incorrectly and self printing PDFs

This is a large PR so here is how the files are structured:

The changes I have made are:

The files to be reviewed are:

image

Before and after (demonstrating that the character rendering issues have been fixed)

image

To do:

Shelve:

Checklist

Ticks in all boxes and 🟢 on all GitHub actions status checks are required to merge:

Frontend

Reminder

By using GitHub, you have already agreed to the GitHub’s Terms of Service including that:

  1. You license your contribution under the same terms as the current repository’s license.
  2. You agree that you have the right to license your contribution under the current repository’s license.
ntarocco commented 6 months ago

Thanks a lot for the work, this is not an easy one :)

About your comments:

Would be a lot better if this used code from the npm module directly: I have tried but getting the same problems as before

I think we should really explore this option, to avoid adding all the built file in the repo. I guess you have already tried it, but what was the issue when doing the following?

  1. in the webpack.py, include all the JS/CSS that you need, example:
    "pdfjs_worker_min_js": "./node_modules/pdfjs-dist/build/pdf.worker.min.mjs",
    "pdfjs_min_js": "./node_modules/pdfjs-dist/build/pdf.min.mjs",
  2. include them when rendering:
    def preview(file):
    ...
    return render_template(
        "invenio_previewer/pdfjs.html",
        file=file,
        js_bundles=current_previewer.js_bundles + ["pdfjs_min_js.js", "pdfjs_worker_min_js.js"],
        css_bundles=...,
    )

    This should expose a pdfjsLib JS object globally, in window.

  3. In your JS, use it:
    const document = pdfjsLib.getDocument(pdfUrl);
    ...

This should avoid building PDF.js with webpack.

And about:

Try upgrade to v4: I haven't tried but I think it's not worth delaying the PR over

Why did you start with v3, and not directly with v4?

ntarocco commented 4 months ago

Recap of our discussion on how to proceed. The main point is that we would like to remove all hardcoded static assets from the GH repo for the various obvious reasons. To achieve that, we should explore adding a new hook in the Webpack config that allows an Invenio module to define a set of files/dirs that will be copied from the assets folder to the static folder when building. This has been already done with TinyMCE here, but it is hardcoded.

In invenio-assets, we can inject Python configuration via the config.js file, which is generated on the file just before invoking the webpack build here.

This is an example of a config.json:

{
  "aliases": {
    ...
    "@scss/invenio_previewer": "scss/invenio_previewer"
  },
  "build": {
    "assetsPath": "/Users/nico/.pyenv/versions/3.9.16/envs/cdsvideos39/var/instance/static/dist",
    "assetsURL": "/static/dist/",
    "context": "/Users/nico/.pyenv/versions/3.9.16/envs/cdsvideos39/var/instance/assets",
    "debug": false,
    "staticPath": "/Users/nico/.pyenv/versions/3.9.16/envs/cdsvideos39/var/instance/static",
    "staticURL": "/static/"
  },
  "entry": {
    "adminlte": "./js/invenio_theme/admin.js",
    "base": "./js/invenio_theme/base.js",    
    ....
    "theme": "./scss/invenio_theme/theme.scss",
    "theme-admin": "./scss/invenio_theme/admin.scss"
  }
}

You can see an example how the various config are used in the Webpack config, for example here. If we need to add a new section, we will probably have to enhance the WebpackBundle class, and the various subclasses, and add the new params. The params are injected in the webpack.py in each module, for example here.

What we want to achieve is to make the CopyWebpackPlugin parametrizable via the config in the webpack.py, probably introducing a new dict object similarly to:

...
    copy: [{
       from: <path>,
       to: <path>
    }]

(naming and final object structure maybe different) And then that new config is injected and used in the Webpack config. Take into account that multiple modules might declare a copy section, so they should be appended to have a final global list. Note: the <path should be relative to <virtualenv>/var/instance. The user should not be allowed to copy from/to other locations.

Possible tasks:

  1. Implement the new hook and test it
  2. Move the hardcoded TinyMCE stuff probably to invenio-theme (or where it makes sense)
  3. Add the copy of PDF.js assets
max-moser commented 1 month ago

I've started an implementation of copy.{to,from} instructions based on (my understanding of) the comment by @ntarocco