Open martinobersteiner opened 2 months ago
Thanks for this great investigation. I'm not sure this is really an RFC, I consider this more of a very detailed bug report.
I think this is the root cause of https://github.com/inveniosoftware/invenio-app-rdm/issues/2679.
I agree with the conclusion that we should be percent encoding in the URI....but others should review since it's a significant change
I am really sorry, I did not have time yet to check if this proposal causes a regression on: https://github.com/zenodo/zenodo-rdm/issues/982
Motivation
At TU Graz, one of our users attempted to upload a filename containing a
#
character to our invenio-instance. Upload of such filenames-containing-#
currently fails. Other invenio-instances have the same issue: zenodo-rdm issue #981Issue Investigation
To summarize the the code involved in the issue:
in frontend, this javascript-function handles upload of files
key
is the filename)startUploadURL
is set to the link built as perRDMFileDraftServiceConfig.file_links_item["content"]
's URI-templatethis._doUpload
eventually callsaxiosWithConfig.put(URLPassedToDoUpload)
To see the issue with filenames-containing-
#
let's walk through steps 1.-4. with a file namedfoo#spam.txt
that a user attempts to upload to record-idasdfg-hjk42
:this._initializeUpload
works and the backend correctly initializes the filefoo#spam.txt
for record-idasdfg-hjk42
"content"
-FileLink to"/api/records/asdfg-hjk42/draft/files/foo#spam.txt/content"
initializedMetadata.links.content
_doUpload(...)
callsaxiosWithConfig.put("/api/records/asdfg-hjk42/draft/files/foo#spam.txt/content")
, but the outgoing request has everything after#
cut off, causing the upload to fail at this pointBackground Information
URI-templates
URI-templates (RFC 6570) are used by invenio to build URIs. In a URI-template,
{key}
signifies to replace itself with the value ofkey
, percent-encoding some characters when doing so. The+
in{+key}
turns off percent-encoding (for some characters) when expandingkey
. e.g. forkey = "foo#spam.text"
,{key}
expands tofoo%25spam.txt
whereas{+key}
expands tofoo#spam.txt
.When to percent-encode?
URI-syntax is defined in RFC3986. In section 2.2, the opening paragraph dictates when to percent-encode characters: "If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed."
flask
's percent-decoding behaviorThe above dictate on when to percent-encode means that compliantly built URIs will usually have their delimiting characters percent-encoded. Hence, servers will often receive percent-encoded URIs and it would be sensible for frameworks to percent-decode them early on.
flask
does indeed do so: Note that in the above image the outgoingGET
request maintains percent-encoding, but themirror
-function receives the decoded URI either way. The above uses the versions offlask
andwerkzeug
that invenio is currently on. However I did also test it with most recent versions offlask
andwerkzeug
, the behavior is the same.Further, stepping through
flask
-internals with a debugger shows that percent-decoding of requested URI:@flask_app.route("/route/<key>")
applies to the requested URI)Reason
{+key}
contains the+
to begin withinvenio used to use
{key}
(without the+
) in URI-templates earlier on. This was changed ininvenio-records-resources
' PR #535, which allows uploading filenames-with-/
by making two kinds of changes:"/files/<key>"
to"/files/<path:key>"
"{+api}/records/{id}/files/{key}"
to"{+api}/records/{id}/files/{+key}"
(adding+
in front ofkey
)(The
+
was also added to corresponding classes ininvenio-rdm-records
in another PR.)Putting things together
As the filename is arguably one component of the URI, not percent-encoding it is non-compliant with RFC3986. This suggests we ought to percent-encode the filename (by dropping the
+
in front of the URI-template'skey
). In local testing this fixed uploading of filenames-containing-#
.Further, as URIs get percent-decoded by
flask
before any invenio-code is even executed, this difference isn't really observable to invenio backend code. Percent-encoding the filename-part of the URI thus shouldn't break anything.As for the PR that introduced the
+
to the URI-template, I think that this addition was unnecessary to achieve its stated goal of fixing upload of filenames-containing-/
. As stated above, that PR did two kinds of changes and I think that:"/files/<key>"
to"/files/<path:key>"
was sufficient to fix uploading filenames-containing-/
"{+api}/records/{id}/files/{key}"
to"{+api}/records/{id}/files/{+key}"
breaks uploading of filenames-containing-#
for seemingly no gainSadly, I have no easy way of testing uploading of filenames-containing-
/
, so this would need to tested by someone with the setup for it...Suggested Changes
Remove
+
in{+key}
for all file-upload related URI-templates:invenio-records-resources
it's these linesinvenio-rdm-records
it's these linesRelated Issues (outside of this RFCs scope)
+
seems inconsistently used across invenio's URI-templatesFor example,
media-files
already don't include+
for theirkey
in file-upload URI-templates (see here). Other issues caused by incorrect choice of{thing}
vs{+thing}
in URI-templates might be lurking in the codebase...path:
in routes makes some filenames match more than one routeHere's two routes after
path:
was added to them:"item": "/files/<path:key>"
"item-content": "/files/<path:key>/content"
Now a file named
something/content
would match both:"item"
route withkey = "something/content"
"item-content"
route withkey = "something"
content
andcommit
(from the"item-commit"
route) seem like somewhat generic names that may eventually pop up Seems lower priority to me, so this might be more of an FYI should you ever run into this...