Diacritics are stripped from filenames when downloading

NateWr commented 3 years ago

Describe the bug When downloading a submission file with a filename that includes diacritics, the filename removes the letters completely. A file named example-edição.xml will be downloaded as example-edio.xml.

To Reproduce Steps to reproduce the behavior:

Go to submission workflow and upload a file named example-edição.xml to submission files.
Download that file.
See the filename

What application are you using? 3.3.0-4

Additional information See original report.

NateWr commented 3 years ago

I'm not finding a clear-cut way to do this that seems to be ideal across all languages. I've experimented with the Stringy library using edição-*&£$L<->/4/ch 丹尼爾 a دانيال1d line \n break.docx as a test filename.

So far, the best result is using dasherize(). That's what we're doing now, without the alphanumeric regex that was stripping out diacritics. Take a look at the Response Header.

\Stringy\Stringy::create($filename)->toLowerCase()->dasherize();

// Response Header
Content-Disposition | attachment; filename="ediÃ§Ã£o-*&Â£$l<->/4/ch-ä¸¹å°¼ç�¾-a-Ø¯Ø§Ù�Ù�Ø§Ù�1d-line-\n-break.docx"

// Filename used in Firefox save dialog
edição- &£$l - _4_ch-丹尼爾-a-دانيال1d-line-n-break.docx

Is this going to cause a problem for some browsers? What about filenames with / in them? We can strip that out with a regex, but are there other characters we should be wary of? cc @asmecher

NateWr commented 3 years ago

I spent some time today writing a test to prevent regressions with this, but having trouble getting Cypress to read the response header correctly. It seems to think that the two identical strings are not equal, possibly because the string when rendered drops some characters or something. Work at https://github.com/NateWr/pkp-lib/tree/i6898_filenames

asmecher commented 3 years ago

@NateWr, the transcoding to ASCII was a simple and extremely blunt way to solve the problem of risky filenames... I worry that cobbling together a less blunt solution without a good piece of widely-used code that already solves that problem will open us up to bad behaviour. Is the solution currently worse than the problem? Does Flysystem have filename escaping/filtering tools?

NateWr commented 3 years ago

@asmecher the approach we're using now with Stringy has the unfortunate side-effect of dropping letters with diacritics. That can change the meaning of a filename from one thing to another, so pretty bad.

Stringy's toAscii() method transcodes the diacritics to their non-diacritic form, but it completely drops the Chinese and Arabic characters in my test string, so I think it's not usable for us.

@janiosarmento shared a mapping that he's used before. I found that this transcoded the diacritics correctly and left the Chinese and Arabic characters in place. But we'd still need to strip the unsafe characters ourselves and of course maintain this mapping over time.

I've just noticed that Stringy is unmaintained. It looks like there is an active fork we could try. Also, Symphony has introduced a String component. Not sure if that can be used on its own.

asmecher commented 3 years ago

@NateWr, sorry, I misunderstood -- I had thought just the accents were stripped, not the entire character. And yes, that doesn't consider Chinese/Arabic.

I'm 100% sure we're going to want a reputable third-party implementation for this. In terms of robustness, we probably couldn't do better than Wordpress's implementation; however, that's not available outside the codebase, unless someone has packaged it (I haven't looked). Alternately, something like indiehd/filename-sanitizer could be promising. And I haven't looked, but it's entirely possible Flysystem will have something too.

NateWr commented 3 years ago

I didn't find anything quite right in Flysystem. They have a normalizePath method which looks like it doesn't really strip bad characters. This comes in the response header:

Content-Disposition | attachment; filename="ediÃ§Ã£o-*&Â£$L<->/4/ch ä¸¹å°¼ç�¾ a Ø¯Ø§Ù�Ù�Ø§Ù�1d line /n break.xml"

And my browser (firefox) converts it to:

edição- &£$L - _4_ch 丹尼爾 a دانيال1d line _n break.xml

I tried using the slugger that is part of Symfony's String component, I get the following result:

edicao-GBP-L-4-ch-dan-ni-er-a-danyal1d-line-n-break-xml

That seems to transpose the Chinese/Arabic characters to Latin without dropping them. (I can't attest to the accuracy of the Chinese latin-ization, but the Arabic دانيال does sound roughly equivalent to danyal.)

So far that is the closest thing that I've found. But I can't say that I'd be comfortable normalizing on Latin characters like this, unless there is a solid technical reason to do so (eg - characters caused problems in browsers or after download or something).

Taking a step back

It's worth noting that we're not talking about file names on the filesystem. We store files under safe, random names. The only problem we're dealing with now is what the file should be named when it is downloaded in the browser.

Maybe we can/should take a more relaxed approach to this? I tried leaving the test filename as-is and my browser (Firefox) seemed to do some sanitization itself in the save dialog:

edição- & £$L - _4_ch 丹尼爾 a دانيال1d line n break.xml

When I tried to do the same in Chromium it replaced unsafe characters with _:

edição-_&_£$L_-__4_ch 丹尼爾 a دانيال1d line n break.xml

Maybe we should just leave the filenames alone? Not even dasherize them (dashes-between-words)?

jonasraoni commented 3 years ago

Maybe we should just leave the filenames alone? Not even dasherize them (dashes-between-words)?

This is how Google Drive offers me to download a file named as фю/?:_.js

content-disposition: attachment;filename="__/?:_.js";filename*=UTF-8''%D1%84%D1%8E%2F%3F%3A_.js

Notice the extra filename* and the codification difference against filename.

I think it's safer to assume we don't know the user's OS and forward path validations to the browser =]

asmecher commented 3 years ago

The filename* looks relevant. From https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition:

The parameters filename and filename differ only in that filename uses the encoding defined in RFC 5987

RFC 5987 is extremely relevant:

By default, message header field parameters in Hypertext Transfer Protocol (HTTP) messages cannot carry characters outside the ISO-8859-1 character set. RFC 2231 defines an encoding mechanism for use in Multipurpose Internet Mail Extensions (MIME) headers. This document specifies an encoding suitable for use in HTTP header fields that is compatible with a profile of the encoding defined in RFC 2231.

So I think as long as we're encoding filenames per RFC5987, and using filename* rather than filename in the header, we're good.

NateWr commented 3 years ago

I don't know how to interpret that RFC. What are the conditions that the filename must comply with?

asmecher commented 3 years ago

@NateWr, try e.g.:

diff --git a/classes/services/PKPFileService.inc.php b/classes/services/PKPFileService.inc.php
index 9337924ae..f7988991b 100644
--- a/classes/services/PKPFileService.inc.php
+++ b/classes/services/PKPFileService.inc.php
@@ -150,7 +150,7 @@ class PKPFileService {
        header("Content-Type: $mimetype");
        header("Content-Length: $filesize");
        header('Accept-Ranges: none');
-       header('Content-Disposition: ' . ($inline ? 'inline' : 'attachment') . "; filename=\"$filename\"");
+       header('Content-Disposition: ' . ($inline ? 'inline' : 'attachment') . "; filename*=UTF-8''" . urlencode($fileName));
        header('Cache-Control: private'); // Workarounds for IE weirdness
        header('Pragma: public');

(The relevant part of the RFC is...

Extended parameters are those where the left-hand side of the assignment ends with an asterisk character.

The value part of an extended parameter (ext-value) is a token that consists of three parts: the REQUIRED character set name (charset), the OPTIONAL language information (language), and a character sequence representing the actual value (value-chars), separated by single quote characters. Note that both character set names and language tags are restricted to the US-ASCII character set, and are matched case-insensitively (see [RFC2978], Section 2.3 and [RFC5646], Section 2.1.1).

Inside the value part, characters not contained in attr-char are encoded into an octet sequence using the specified character set. That octet sequence is then percent-encoded as specified in Section 2.1 of [RFC3986].

Producers MUST use either the "UTF-8" ([RFC3629]) or the "ISO-8859-1" ([ISO-8859-1]) character set. Extension character sets (mime- charset) are reserved for future use.

Note: recipients should be prepared to handle encoding errors, such as malformed or incomplete percent escape sequences, or non- decodable octet sequences, in a robust manner. This specification does not mandate any specific behavior, for instance, the following strategies are all acceptable:

...)

Yes, that use of single-quote characters as separators is some weeeeeeeird syntax, but it's in the RFC.

NateWr commented 3 years ago

Thanks @asmecher! Ok, so I've left filenames as they are in the UI, and used filename*= and urlencode() in the response headers.

PRs: https://github.com/pkp/pkp-lib/pull/6930 https://github.com/pkp/citationStyleLanguage/pull/77

Tests: https://github.com/pkp/ojs/pull/3100

@amandastevens a heads-up in case you get questions about this from users. This is going to change how filenames appear in the next maintenance release (3.3.0-6). So for people that were on 3.x, you may see questions about how the names of submission files that are downloaded have changed.

In 3.x, filenames like this:

edição-*&£$L<->/4/ch 丹尼爾 a دانيال1d line \n break.docx
Nathan Wright Submission.docx

Would have been downloaded as:

edio-l-4ch-a-1d-line-n-break.docx
nathan-wright-submission.docx

Now they will be downloaded as:

edição-£$L - _4_ch+丹尼爾+a+دانيال1d+line+_n+break.docx
Nathan+Wright+Submission.docx

This change supports a wider range of character sets (eg - non-Latin languages, and Latin languages with diacritics).

NateWr commented 3 years ago

Merged to stable-3_3_0 and cherry-picked to stable-3_3_1 and main.

NateWr commented 3 years ago

The following PR tries to reduce test failure due to race conditions.

PR: https://github.com/pkp/pkp-lib/pull/6946

Tests: https://github.com/pkp/ojs/pull/3105

NateWr commented 3 years ago

@asmecher hopefully this reduces test failures but if it continues to be a problem feel free to yank the test and I can work on it next week.

NateWr commented 3 years ago

Arg! The new syntax, filename*=UTF-8''", doesn't seem to be working correctly in most browsers. I have confirmed the bug in Chrome and another user has confirmed it in a number of browsers.

In the following screenshot, you can see that the response headers look correct, but Chromium is downloading the file using the last path fragment of the URL (eg - /download/mwandenga/13 is downloaded as 13). It also doesn't give me an option to "Save As", but maybe that is typical Chromium behaviour.

Screenshot 2021-05-20 09:54:59

@asmecher does anything obvious jump out at you based on the response headers in this screenshot? If not, I'll investigate shortly. I've reopened this and scheduled it for 3.3.0-7.

NateWr commented 3 years ago

Ok, I noticed that the Google Docs example that @jonasraoni provided uses both filename= and filename*=UTF-8 in the Content-Disposition header. We were just using filename*=UTF-8. I changed this to use both and it seems to fix the issue in Chrome.

@asmecher does this seem correct based on your reading of the RFC? Or by using filename= are we just making the filename*=UTF-8 part useless?

PR: https://github.com/pkp/pkp-lib/pull/7070

Tests: https://github.com/pkp/ojs/pull/3133

asmecher commented 3 years ago

From RFC5987 section 4.2:

This specification suggests that a parameter using the extended syntax takes precedence. This would allow producers to use both formats without breaking recipients that do not understand the extended syntax yet.

So we should be OK to use both filename* and filename, with filename* taking precedence in clients that support both.

NateWr commented 3 years ago

Thanks, merged to stable-3_3_0 and cherry-picked to stable-3_3_1 and main.

pkp / pkp-lib

Diacritics are stripped from filenames when downloading #6898