inveniosoftware / rfcs

RFCs for Invenio.
https://rfcs.readthedocs.io
4 stars 15 forks source link

[Proposal] percent-encode filenames in URIs sent to front-end #88

Open martinobersteiner opened 2 months ago

martinobersteiner commented 2 months ago

Motivation

At TU Graz, one of our users attempted to upload a filename containing a # character to our invenio-instance. Upload of such filenames-containing-# currently fails. Other invenio-instances have the same issue: zenodo-rdm issue #981

Issue Investigation

To summarize the the code involved in the issue:

  1. in frontend, this javascript-function handles upload of files

    _startNewUpload = async (initializeUploadURL, file) => {
    let initializedFileMetadata;
    try {
      initializedFileMetadata = await this._initializeUpload(initializeUploadURL, file);
    } catch (error) {
      this._onError(file);
      return;
    }
    
    const startUploadURL = initializedFileMetadata.links.content;
    const commitFileURL = initializedFileMetadata.links.commit;
    try {
      await this._doUpload(startUploadURL, file);
      ...
    } catch(...) { ... }
    ...
    }
  2. URIs for file-uploads are built according to these URI-templates and then sent to frontend (note that key is the filename)
    class RDMFileDraftServiceConfig(FileServiceConfig, ConfiguratorMixin):
      ...
      file_links_item = {
          "self": FileLink("{+api}/records/{id}/draft/files/{+key}"),
          "content": FileLink("{+api}/records/{id}/draft/files/{+key}/content"),
          "commit": FileLink("{+api}/records/{id}/draft/files/{+key}/commit"),
          ...,
      }
  3. in the javascript-function, startUploadURL is set to the link built as per RDMFileDraftServiceConfig.file_links_item["content"]'s URI-template
  4. the then-called this._doUpload eventually calls axiosWithConfig.put(URLPassedToDoUpload)

To see the issue with filenames-containing-# let's walk through steps 1.-4. with a file named foo#spam.txt that a user attempts to upload to record-id asdfg-hjk42:

  1. this._initializeUpload works and the backend correctly initializes the file foo#spam.txt for record-id asdfg-hjk42
  2. backend builds the "content"-FileLink to "/api/records/asdfg-hjk42/draft/files/foo#spam.txt/content"
  3. this link is received by frontend and made available under initializedMetadata.links.content
  4. _doUpload(...) calls axiosWithConfig.put("/api/records/asdfg-hjk42/draft/files/foo#spam.txt/content"), but the outgoing request has everything after # cut off, causing the upload to fail at this point

Background Information

URI-templates

URI-templates (RFC 6570) are used by invenio to build URIs. In a URI-template, {key} signifies to replace itself with the value of key, percent-encoding some characters when doing so. The + in {+key} turns off percent-encoding (for some characters) when expanding key. e.g. for key = "foo#spam.text", {key} expands to foo%25spam.txt whereas {+key} expands to foo#spam.txt.

When to percent-encode?

URI-syntax is defined in RFC3986. In section 2.2, the opening paragraph dictates when to percent-encode characters: "If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed."

flask's percent-decoding behavior

The above dictate on when to percent-encode means that compliantly built URIs will usually have their delimiting characters percent-encoded. Hence, servers will often receive percent-encoded URIs and it would be sensible for frameworks to percent-decode them early on. flask does indeed do so: requests to basic flask app with/without percent-encoding of URIs Note that in the above image the outgoing GET request maintains percent-encoding, but the mirror-function receives the decoded URI either way. The above uses the versions of flask and werkzeug that invenio is currently on. However I did also test it with most recent versions of flask and werkzeug, the behavior is the same.

Further, stepping through flask-internals with a debugger shows that percent-decoding of requested URI:

Reason {+key} contains the + to begin with

invenio used to use {key} (without the +) in URI-templates earlier on. This was changed in invenio-records-resources' PR #535, which allows uploading filenames-with-/ by making two kinds of changes:

  1. change routes from "/files/<key>" to "/files/<path:key>"
  2. change URI-templates from "{+api}/records/{id}/files/{key}" to "{+api}/records/{id}/files/{+key}" (adding + in front of key)

(The + was also added to corresponding classes in invenio-rdm-records in another PR.)

Putting things together

As the filename is arguably one component of the URI, not percent-encoding it is non-compliant with RFC3986. This suggests we ought to percent-encode the filename (by dropping the + in front of the URI-template's key). In local testing this fixed uploading of filenames-containing-#.

Further, as URIs get percent-decoded by flask before any invenio-code is even executed, this difference isn't really observable to invenio backend code. Percent-encoding the filename-part of the URI thus shouldn't break anything.

As for the PR that introduced the + to the URI-template, I think that this addition was unnecessary to achieve its stated goal of fixing upload of filenames-containing-/. As stated above, that PR did two kinds of changes and I think that:

  1. changing routes from "/files/<key>" to "/files/<path:key>" was sufficient to fix uploading filenames-containing-/
  2. changing URI-templates from "{+api}/records/{id}/files/{key}" to "{+api}/records/{id}/files/{+key}" breaks uploading of filenames-containing-# for seemingly no gain

Sadly, I have no easy way of testing uploading of filenames-containing-/, so this would need to tested by someone with the setup for it...

Suggested Changes

Remove + in {+key} for all file-upload related URI-templates:

Related Issues (outside of this RFCs scope)

+ seems inconsistently used across invenio's URI-templates

For example, media-files already don't include + for their key in file-upload URI-templates (see here). Other issues caused by incorrect choice of {thing} vs {+thing} in URI-templates might be lurking in the codebase...

path: in routes makes some filenames match more than one route

Here's two routes after path: was added to them:

  1. "item": "/files/<path:key>"
  2. "item-content": "/files/<path:key>/content"

Now a file named something/content would match both:

  1. the "item" route with key = "something/content"
  2. the "item-content" route with key = "something"

content and commit (from the "item-commit" route) seem like somewhat generic names that may eventually pop up Seems lower priority to me, so this might be more of an FYI should you ever run into this...

tmorrell commented 2 months ago

Thanks for this great investigation. I'm not sure this is really an RFC, I consider this more of a very detailed bug report.

I think this is the root cause of https://github.com/inveniosoftware/invenio-app-rdm/issues/2679.

I agree with the conclusion that we should be percent encoding in the URI....but others should review since it's a significant change

ptamarit commented 1 month ago

I am really sorry, I did not have time yet to check if this proposal causes a regression on: https://github.com/zenodo/zenodo-rdm/issues/982