matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.69k stars 2.62k forks source link

CSV reports can fail because HTTP Content-Disposition header has invalid characters in the filename field #17209

Closed Geal closed 3 years ago

Geal commented 3 years ago

Hello, one of our clients uses matomo (I do not know which version exactly), and some HTTP responses fail when downloading reports, because of charset issues in the Content-Disposition header. Here's a hex dump of one of those responses:

00000000      43 6f 6e 74 65 6e 74 2d 44 69 73 70 6f 73 69 74         Content-Disposit
00000010      69 6f 6e 3a 20 61 74 74 61 63 68 6d 65 6e 74 3b         ion: attachment;
00000020      20 66 69 6c 65 6e 61 6d 65 3d 22 45 78 70 6f 72          filename="Expor
00000030      74 20 5f 20 4d 61 69 6e 20 6d 65 74 72 69 63 73         t _ Main metrics
00000040      20 5f 20 44 65 63 65 6d 62 65 72 20 31 33 2c 20          _ December 13,
00000050      32 30 32 30 20 e2 80 93 20 4a 61 6e 75 61 72 79         2020 – January
00000060      20 31 31 2c 20 32 30 32 31 2e 63 73 76 22 0d 0a          11, 2021.csv"..
00000070      54 72 61 6e 73 66 65 72 2d 45 6e 63 6f 64 69 6e         Transfer-Encodin
00000080      67 3a 20 63 68 75 6e 6b 65 64 0d 0a 43 6f 6e 74         g: chunked..Cont

right after "2020", there's the character, which is an en dash encoded as e2 80 93 in UTF8.

According to https://tools.ietf.org/html/rfc6266#section-4, when using the filename="" format, the name between double quotes should be (https://tools.ietf.org/html/rfc2616#section-2.2) in ISO-8859-1 charset, or in RFC 2047 format, like this: =?iso-8859-1?q?this is some text?= (for what it's worth, I never see anything in that format lately)

If the filename must include UTF-8 characters, it should use the filename*="" option, like this: UTF-8''%c2%a3%20and%20%e2%82%ac%20rates (cf https://tools.ietf.org/html/rfc5987#section-3.2.2 ) (the exact format is defined in https://tools.ietf.org/html/rfc5987#section-3.2.2 )

Unfortunately, I do not control this deployment of matomo, so my ability to test patches is limited, but I can request further information.

maybe related to #9580

Geal commented 3 years ago

the query was generated with a call to a URL with this format: https://domain/index.php?date=2020-12-13,2021-01-11&expanded=1&filter_limit=-1&format=CSV&format_metrics=1&idSite=1&language=en&method=API.get&module=API&period=day&token_auth=<token>&translateColumnNames=1

Findus23 commented 3 years ago

Hi,

I think the code responsible for this is the following: https://github.com/matomo-org/matomo/blob/c870770157a3e9c893308967dc274c8feac5d4be/core/DataTable/Renderer/Csv.php#L311-L321

It just generates a nice filename and then puts it into the header without caring about the right encoding.

If you have an idea how this could be fixed, it would be great if you could create a PR.

tsteur commented 3 years ago

one of our clients uses matomo and some HTTP responses fail when downloading reports,

Hi @Geal does the download not work at all in this case? Do you know what browser is being used there or is it maybe some server that fetches the file?

Geal commented 3 years ago

Downloads failed because the HTTP response went through our reverse proxy which rejects invalid headers. It is independent of the browser that is used, it can even be reproduced with a curl command.

The utf-8 data can be transformed properly with rawurlencode: https://stackoverflow.com/a/25704866

The code should know the actual encoding of the filename (ascii, iso 8859 1, utf-8 or others) and specify it in the header. Are there any guarantees on the encoding used in matomo?

tsteur commented 3 years ago

AFAIK the encoding should be UTF8 or UTF 16 (there should be some parameter to request data in UTF 18). Since it seems like an easy fix will schedule this issue. Cheers @Geal

flamisz commented 3 years ago

fixed by #17276