matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.94k stars 2.66k forks source link

Matomo JS download-tracking sends entire data URL in payload, when using a `download` attribute on a link with blob `href` #20510

Open sjdemartini opened 1 year ago

sjdemartini commented 1 year ago

If you have an anchor element with a download attribute, and outbound link-tracking / download-tracking is enabled, Matomo will always track clicks on those as downloads (see here https://github.com/matomo-org/matomo/issues/19708). This is fine, however, if the href is a data-URL, which is being used to download some file encoded directly into the HTML, Matomo sends the entire data-URL in the payload to the Matomo server. This can be very expensive for the client to send, and it is not useful in terms of tracking.

Expected Behavior

When clicking on an <a /> element that has a download attribute and a data:... URL as its href, Matomo should not track that download by passing the entire data URL to the Matomo server. Either the download should be ignored by Matomo's JS and not tracked, or the download tracking should omit the contents of the data URL.

You should not need to add a manual "config ignore" class on these a elements whenever you have a data URL. In some cases, this is not possible, like when using external libraries that create the anchor elements. (For instance, Highcharts uses a download anchor with a data URL to let users download images of their charts, which is not controlled by end-users of Highcharts.)

Current Behavior

Matomo sends the entire contents of the href data URL to the Matomo server. The download doesn't actually seem to show up in Matomo tracking on the Downloads page either, presumably because it's not able to be parsed by the Matomo server. So the request is entirely a waste of bandwidth/resources.

Possible Solution

Ignore tracking links if the href begins with data:. (Presumably that would go somewhere in the getLinkType function.)

Steps to Reproduce (for Bugs)

  1. Enable outbound link-tracking and download-tracking with the client using Matomo (docs): _paq.push(['enableLinkTracking']);
  2. Add a link within your client page which uses a Data URL as its href and sets the download attribute:

    <a
      href="data:image/gif;base64,R0lGODlhEAAQAMQAAORHHOVSKudfOulrSOp3WOyDZu6QdvCchPGolfO0o/XBs/fNwfjZ0frl3/zy7////wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACH5BAkAABAALAAAAAAQABAAAAVVICSOZGlCQAosJ6mu7fiyZeKqNKToQGDsM8hBADgUXoGAiqhSvp5QAnQKGIgUhwFUYLCVDFCrKUE1lBavAViFIDlTImbKC5Gm2hB0SlBCBMQiB0UjIQA7"
      download="my_image.gif"
    >
      Download image
    </a>
  3. Click the link
  4. Notice that Matomo sends the entire data URL in the payload to Matomo. e.g. here is the Chrome Dev Tools Network tab showing the request to Matomo: image

With the above example the href is relatively small so it's not as consequential, but for very large files/data, you can see how this gets very expensive.

Your Environment

sgiehl commented 1 year ago

Hi @sjdemartini Thanks for creating this detailed report. It totally makes sense to exclude any data urls from tracking. This does neither make sense for downloads nor for outlinks.

Actually I guess this could also be a privacy issue. If the generated data url contains some personal account information they would be tracked in Matomo, which could be problematic in the terms of privacy laws.

@tsteur @mattab This might be something we should maybe change in Matomo 5 similar to https://github.com/matomo-org/matomo/issues/17017

heurteph-ei commented 1 year ago

Another possibility in this case could be track as data for the url value instead of the full content of the href. Then no personal data leak, no high network consumption, but the download tracking still work as expected (track all downloads)...

sgiehl commented 1 year ago

@heurteph-ei That also sounds like a useful idea, but in this case we might need to mark or explain such data records in the UI, so it's clear to the user what it means.