Closed NateWr closed 3 years ago
I'm not finding a clear-cut way to do this that seems to be ideal across all languages. I've experimented with the Stringy library using edição-*&£$L<->/4/ch 丹尼爾 a دانيال1d line \n break.docx
as a test filename.
So far, the best result is using dasherize()
. That's what we're doing now, without the alphanumeric regex that was stripping out diacritics. Take a look at the Response Header.
\Stringy\Stringy::create($filename)->toLowerCase()->dasherize();
// Response Header
Content-Disposition | attachment; filename="edição-*&£$l<->/4/ch-丹尼�-a-دا��ا�1d-line-\n-break.docx"
// Filename used in Firefox save dialog
edição- &£$l - _4_ch-丹尼爾-a-دانيال1d-line-n-break.docx
Is this going to cause a problem for some browsers? What about filenames with /
in them? We can strip that out with a regex, but are there other characters we should be wary of? cc @asmecher
I spent some time today writing a test to prevent regressions with this, but having trouble getting Cypress to read the response header correctly. It seems to think that the two identical strings are not equal, possibly because the string when rendered drops some characters or something. Work at https://github.com/NateWr/pkp-lib/tree/i6898_filenames
@NateWr, the transcoding to ASCII was a simple and extremely blunt way to solve the problem of risky filenames... I worry that cobbling together a less blunt solution without a good piece of widely-used code that already solves that problem will open us up to bad behaviour. Is the solution currently worse than the problem? Does Flysystem have filename escaping/filtering tools?
@asmecher the approach we're using now with Stringy has the unfortunate side-effect of dropping letters with diacritics. That can change the meaning of a filename from one thing to another, so pretty bad.
Stringy's toAscii()
method transcodes the diacritics to their non-diacritic form, but it completely drops the Chinese and Arabic characters in my test string, so I think it's not usable for us.
@janiosarmento shared a mapping that he's used before. I found that this transcoded the diacritics correctly and left the Chinese and Arabic characters in place. But we'd still need to strip the unsafe characters ourselves and of course maintain this mapping over time.
I've just noticed that Stringy is unmaintained. It looks like there is an active fork we could try. Also, Symphony has introduced a String component. Not sure if that can be used on its own.
@NateWr, sorry, I misunderstood -- I had thought just the accents were stripped, not the entire character. And yes, that doesn't consider Chinese/Arabic.
I'm 100% sure we're going to want a reputable third-party implementation for this. In terms of robustness, we probably couldn't do better than Wordpress's implementation; however, that's not available outside the codebase, unless someone has packaged it (I haven't looked). Alternately, something like indiehd/filename-sanitizer could be promising. And I haven't looked, but it's entirely possible Flysystem will have something too.
I didn't find anything quite right in Flysystem. They have a normalizePath
method which looks like it doesn't really strip bad characters. This comes in the response header:
Content-Disposition | attachment; filename="edição-*&£$L<->/4/ch 丹尼� a دا��ا�1d line /n break.xml"
And my browser (firefox) converts it to:
edição- &£$L - _4_ch 丹尼爾 a دانيال1d line _n break.xml
I tried using the slugger that is part of Symfony's String
component, I get the following result:
edicao-GBP-L-4-ch-dan-ni-er-a-danyal1d-line-n-break-xml
That seems to transpose the Chinese/Arabic characters to Latin without dropping them. (I can't attest to the accuracy of the Chinese latin-ization, but the Arabic دانيال
does sound roughly equivalent to danyal
.)
So far that is the closest thing that I've found. But I can't say that I'd be comfortable normalizing on Latin characters like this, unless there is a solid technical reason to do so (eg - characters caused problems in browsers or after download or something).
Taking a step back
It's worth noting that we're not talking about file names on the filesystem. We store files under safe, random names. The only problem we're dealing with now is what the file should be named when it is downloaded in the browser.
Maybe we can/should take a more relaxed approach to this? I tried leaving the test filename as-is and my browser (Firefox) seemed to do some sanitization itself in the save dialog:
edição- & £$L - _4_ch 丹尼爾 a دانيال1d line n break.xml
When I tried to do the same in Chromium it replaced unsafe characters with _
:
edição-_&_£$L_-__4_ch 丹尼爾 a دانيال1d line n break.xml
Maybe we should just leave the filenames alone? Not even dasherize them (dashes-between-words)?
Maybe we should just leave the filenames alone? Not even dasherize them (dashes-between-words)?
This is how Google Drive offers me to download a file named as фю/?:_.js
content-disposition: attachment;filename="__/?:_.js";filename*=UTF-8''%D1%84%D1%8E%2F%3F%3A_.js
Notice the extra filename*
and the codification difference against filename
.
I think it's safer to assume we don't know the user's OS and forward path validations to the browser =]
The filename*
looks relevant. From https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition:
The parameters filename and filename differ only in that filename uses the encoding defined in RFC 5987
RFC 5987 is extremely relevant:
By default, message header field parameters in Hypertext Transfer Protocol (HTTP) messages cannot carry characters outside the ISO-8859-1 character set. RFC 2231 defines an encoding mechanism for use in Multipurpose Internet Mail Extensions (MIME) headers. This document specifies an encoding suitable for use in HTTP header fields that is compatible with a profile of the encoding defined in RFC 2231.
So I think as long as we're encoding filenames per RFC5987, and using filename*
rather than filename
in the header, we're good.
I don't know how to interpret that RFC. What are the conditions that the filename must comply with?
@NateWr, try e.g.:
diff --git a/classes/services/PKPFileService.inc.php b/classes/services/PKPFileService.inc.php
index 9337924ae..f7988991b 100644
--- a/classes/services/PKPFileService.inc.php
+++ b/classes/services/PKPFileService.inc.php
@@ -150,7 +150,7 @@ class PKPFileService {
header("Content-Type: $mimetype");
header("Content-Length: $filesize");
header('Accept-Ranges: none');
- header('Content-Disposition: ' . ($inline ? 'inline' : 'attachment') . "; filename=\"$filename\"");
+ header('Content-Disposition: ' . ($inline ? 'inline' : 'attachment') . "; filename*=UTF-8''" . urlencode($fileName));
header('Cache-Control: private'); // Workarounds for IE weirdness
header('Pragma: public');
(The relevant part of the RFC is...
Extended parameters are those where the left-hand side of the assignment ends with an asterisk character.
The value part of an extended parameter (ext-value) is a token that consists of three parts: the REQUIRED character set name (charset), the OPTIONAL language information (language), and a character sequence representing the actual value (value-chars), separated by single quote characters. Note that both character set names and language tags are restricted to the US-ASCII character set, and are matched case-insensitively (see [RFC2978], Section 2.3 and [RFC5646], Section 2.1.1).
Inside the value part, characters not contained in attr-char are encoded into an octet sequence using the specified character set. That octet sequence is then percent-encoded as specified in Section 2.1 of [RFC3986].
Producers MUST use either the "UTF-8" ([RFC3629]) or the "ISO-8859-1" ([ISO-8859-1]) character set. Extension character sets (mime- charset) are reserved for future use.
Note: recipients should be prepared to handle encoding errors, such as malformed or incomplete percent escape sequences, or non- decodable octet sequences, in a robust manner. This specification does not mandate any specific behavior, for instance, the following strategies are all acceptable:
...)
Yes, that use of single-quote characters as separators is some weeeeeeeird syntax, but it's in the RFC.
Thanks @asmecher! Ok, so I've left filenames as they are in the UI, and used filename*=
and urlencode()
in the response headers.
PRs: https://github.com/pkp/pkp-lib/pull/6930 https://github.com/pkp/citationStyleLanguage/pull/77
Tests: https://github.com/pkp/ojs/pull/3100
@amandastevens a heads-up in case you get questions about this from users. This is going to change how filenames appear in the next maintenance release (3.3.0-6). So for people that were on 3.x, you may see questions about how the names of submission files that are downloaded have changed.
In 3.x, filenames like this:
edição-*&£$L<->/4/ch 丹尼爾 a دانيال1d line \n break.docx
Nathan Wright Submission.docx
Would have been downloaded as:
edio-l-4ch-a-1d-line-n-break.docx
nathan-wright-submission.docx
Now they will be downloaded as:
edição-£$L - _4_ch+丹尼爾+a+دانيال1d+line+_n+break.docx
Nathan+Wright+Submission.docx
This change supports a wider range of character sets (eg - non-Latin languages, and Latin languages with diacritics).
Merged to stable-3_3_0
and cherry-picked to stable-3_3_1
and main
.
The following PR tries to reduce test failure due to race conditions.
@asmecher hopefully this reduces test failures but if it continues to be a problem feel free to yank the test and I can work on it next week.
Arg! The new syntax, filename*=UTF-8''"
, doesn't seem to be working correctly in most browsers. I have confirmed the bug in Chrome and another user has confirmed it in a number of browsers.
In the following screenshot, you can see that the response headers look correct, but Chromium is downloading the file using the last path fragment of the URL (eg - /download/mwandenga/13
is downloaded as 13
). It also doesn't give me an option to "Save As", but maybe that is typical Chromium behaviour.
@asmecher does anything obvious jump out at you based on the response headers in this screenshot? If not, I'll investigate shortly. I've reopened this and scheduled it for 3.3.0-7.
Ok, I noticed that the Google Docs example that @jonasraoni provided uses both filename=
and filename*=UTF-8
in the Content-Disposition
header. We were just using filename*=UTF-8
. I changed this to use both and it seems to fix the issue in Chrome.
@asmecher does this seem correct based on your reading of the RFC? Or by using filename=
are we just making the filename*=UTF-8
part useless?
From RFC5987 section 4.2:
This specification suggests that a parameter using the extended syntax takes precedence. This would allow producers to use both formats without breaking recipients that do not understand the extended syntax yet.
So we should be OK to use both filename*
and filename
, with filename*
taking precedence in clients that support both.
Thanks, merged to stable-3_3_0
and cherry-picked to stable-3_3_1
and main
.
Describe the bug When downloading a submission file with a filename that includes diacritics, the filename removes the letters completely. A file named
example-edição.xml
will be downloaded asexample-edio.xml
.To Reproduce Steps to reproduce the behavior:
example-edição.xml
to submission files.What application are you using? 3.3.0-4
Additional information See original report.