tweaselORG / TrackHAR

Library for detecting tracking data transmissions from traffic in HAR format.
Creative Commons Zero v1.0 Universal
5 stars 0 forks source link

`unhar`: Content extraction insufficient #58

Open baltpeter opened 6 months ago

baltpeter commented 6 months ago

In unhar(), we are currently assuming that a HAR can only hold a request body in request.postData.text:

https://github.com/tweaselORG/TrackHAR/blob/b2191417c38634dcc0124e07b14f282bdc80f404/src/common/request.ts#L45

That is not true. It can also have POST params in request.postData.params (which we are currently just ignoring):

http://www.softwareishard.com/blog/har-12-spec/#postData

baltpeter commented 6 months ago

It's really unfortunate that there are so many differences between HAR implementations. :(

Note that text and params fields are mutually exclusive.

Yeah, that's not true at all in practice. Let's go through a few examples.

File upload

I've used https://cgi-lib.berkeley.edu/ex/fup.cgi to capture a HAR of a simple file upload in Firefox (file-upload-firefox.json) and Chrome (file-upload-chrome.har).

The site uses multipart/form-data as the encoding:

image

In Firefox, the raw multipart encoded data ends up as a string in text in the HAR:

image

In Chrome, meanwhile, both text and params are populated in the HAR:

image

In params, the file I uploaded is "helpfully" replaced with (binary), in text, it appears to be missing entirely. In fact, as far as I can tell, the uploaded file isn't included anywhere in the HAR. o.o

And indeed, I can't seem to find a way to get to it in the Chrome dev tools, either:

image

image

So, don't use Chrome to generate HAR files if you want them to actually contain everything you've uploaded, I guess? Phenomenal stuff.

HTML form

I also tried a more simple case of this basic HTML form in Chrome (post-chrome.json) and Firefox (post-firefox.har):

<!DOCTYPE html>
<html>
<body>
<form action="https://example.org" method="post">
    <input type="text" name="test">
    <input type="submit">
</form>
</body>
</html>

As I didn't set an enctype, the data is transmitted as application/x-www-form-urlencoded (the default).

In Firefox, both text and params are populated, with the raw and parsed data, respectively:

image

The same is the case in Chrome:

image

Other implementations

I also tried two other HAR implementations. First, Insomnia (post-insomnia.har, multipart-insomnia.har), which only populates params in both cases:

image

image

And, more importantly for us, the mitmproxy HAR dump script. I only had an example for application/x-www-form-urlencoded lying around. In that case, it populates both params and text:

image