Unable to inspect gzip data sent to app.posthog.com: binary data sent with Content-Type: text/plain

gberche-orange commented 1 year ago

Describe the bug

This is a follow up of https://github.com/kubeshop/testkube/issues/3609

Further trying to display the content of the posthog.com requests issued in version 1.11.220, I'm unable to decode the gzip compression of requests sent: the gzip complains about end of file or invalid format.

This reproduces both with requests saved as curl bash in firefox, as well as HAR saved directly from the browser to the filesystem on ubuntu 20.0.4

cat test.har | jq -r '[ .[].entries[].request | select(.url | contains("posthog.com") and contains("gzip")) | .postData.text ][0]' | gunzip --verbose

gzip: stdin: not in gzip format

Both the HAR and curl commands strings captured by the browser contain binary characters.

I'm suspecting this comes from the http request to app.posthog.com now having a Content-Type: text/plain which tells the browser to consider the posted data as UTF-8 text, whereas it should rather be application/gzip

https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Common_types specifies that plain text content type should not have binary data

text/plain is the default value for textual files. A textual file should be human-readable and must not contain binary data.

Not being able to inspect data sent to 3rd party is likely preventing users from accepting anonymized telemetry.

If the proper application/gzip or application/octet-stream content type was sent to the request to app.posthog.com, then likely the curl and HAR exports would use the proper binary character encoding and therefore allow gzip decompression and data inspection.

https://w3c.github.io/web-performance/specs/HAR/Overview.html

A HAR file is REQUIRED to be saved in UTF-8 encoding. Other encodings are forbidden. A reader MUST ignore a byte-order mark if it exists in the file, and a writer MAY emit a byte-order mark in the file. Before setting the text field, the HTTP request/response is decoded (decompressed & unchunked), than trans-coded from its original character set into UTF-8. Additionally, it can be encoded using e.g. base64. Ideally, the application should be able to unencode a base64 blob and get a byte-for-byte identical resource to what the browser operated on.

The HAR would then include base64 data that would be decoded before piping into gzip

Note use of gztool to further debug/diagnose the gzip data

https://unix.stackexchange.com/a/543086/381792

  sudo add-apt-repository ppa:roberto.s.galende/gztool
  sudo apt-get update
  sudo apt-get install gztool

cat test.har | jq -r '[ .[].entries[].request | select(.url | contains("posthog.com") and contains("gzip")) | .postData.text ][0]' > posthog.gz
gztool -d -v 5 posthog.gz 
ACTION: Decompress file

  -a: 4,    -A: 0,  -b: 0,  -c: 0
  -C: 0,    -d: 1,  -D: 0,  -e: 0
  -E: 0,    -f: 0,  -F: 0
  -i: 0,    -I: (null)
  -l: 0,    -L: 0,  -n: 1
  -p: 0,    -P: 0
  -r: 0,    -R: 0
  -s: 10485760,     -S: 0,  -t: 0
  -T: 0,    -u:  ,  -v: 5,  -w: 0
  -W: 0,    -x: [1],    -X: 0
  -z: 0,    -Z: 0,  -[0-9]: -1

Processing 'posthog.gz' ...
Decompressing to 'posthog'
inflateInit2 = 0
implicit `-x`: 1 (decompress_and_build_index).
[read 926 B](936<0?)output_data_counter=0,totin=0,totout=0,totlines=1,ftello=926,avail_in=926
ERR -3: totin=2, totout=0, ftello=926
ERROR: compressed data error @2.
ERROR: Compressed data error in 'posthog.gz'.
ERROR: decompressing 'posthog.gz' file.

ERROR code = 1
Aborted.
1 files processed with errors!

Valid gzip archive tool output:

echo text > text.txt
gzip text.txt
gztool -d text.txt.gz -v 5

ACTION: Decompress file

  -a: 4,    -A: 0,  -b: 0,  -c: 0
  -C: 0,    -d: 1,  -D: 0,  -e: 0
  -E: 0,    -f: 0,  -F: 0
  -i: 0,    -I: (null)
  -l: 0,    -L: 0,  -n: 1
  -p: 0,    -P: 0
  -r: 0,    -R: 0
  -s: 10485760,     -S: 0,  -t: 0
  -T: 0,    -u:  ,  -v: 5,  -w: 0
  -W: 0,    -x: [1],    -X: 0
  -z: 0,    -Z: 0,  -[0-9]: -1

Processing 'text.txt.gz' ...
Decompressing to 'text.txt'
inflateInit2 = 0
implicit `-x`: 1 (decompress_and_build_index).
[read 34 B](44<0?)output_data_counter=0,totin=0,totout=0,totlines=1,ftello=34,avail_in=34
[>1>0,0,32768,15][>1>0]actual_index_point = 1,(0, 0)
[>1>0,5,32763,8][>1>5]Z_STREAM_END: totin=34, totout=5, ftello=34
Correct END OF GZIP file detected at EOF.
5.00 Bytes (5 Bytes) of data extracted.
1 lines extracted.
Deleting file 'text.txt.gz'

ERROR code = 0
1 files processed

To Reproduce Capture app.posthog.com requests as curl bash and try to decode them with gunzip Capture app.posthog.com requests as HAR and try to decode them with gunzip

Expected behavior

As a testkube user in order to trust anonimized data sent during telemetry I need to be able to inspect this data in clear text

Version / Cluster

Which testkube version? 1.11.220
What Kubernetes cluster? (e.g. GKE, EKS, Openshift etc, local KinD, local Minikube)
What Kubernetes version?

Screenshots If applicable, add CLI commands/output to help explain your problem.

Additional context Add any other context about the problem here.

rangoo94 commented 1 year ago

Thanks @gberche-orange for the report! I'll check it tomorrow and either will fix it or will get back to you with a solution.

rangoo94 commented 1 year ago

Small note: looks like the query for jq was catching the items that didn't have postData. Adding | select(.postData != null) will filter out empty values.

The actual issue is most likely related to a Chromium bug, but I haven't identified yet the direct issue in their bug tracker. To verify that the bug exists in Chrome though:

Copy the network call as cURL
Run the generated cURL command
See the results

I took some sample calls that finished in the browser with {"status": 1}, but the generated cURL command ended with:

HTTP/2 400
content-type: application/json
date: Tue, 23 May 2023 14:17:44 GMT
access-control-allow-origin: https://cloud.testkube.io
access-control-allow-credentials: true
access-control-allow-methods: GET, POST, OPTIONS
access-control-allow-headers: X-Requested-With,Content-Type
x-content-type-options: nosniff
referrer-policy: same-origin
x-cache: Error from cloudfront
via: 1.1 604f8ac78ed3ba5235c1a14794f2ac64.cloudfront.net (CloudFront)
x-amz-cf-pop: FRA56-P5
x-amz-cf-id: SvBW9Cu_iU1BnpqdR7FfGixrqFbAIlKlPSBUUD1_7eB4FztDhzzHTg==

{"type": "validation_error", "code": "invalid_payload", "detail": "Malformed request data: Failed to decompress data. Not a gzipped file (b'\\x1f\\xc2')", "attr": null}

Looks like the bug in Chromium appeared somewhere in April/May, as I've been able to successfully gunzip the payload on 4th May.

gberche-orange commented 1 year ago

The actual issue is most likely related to a Chromium bug

I also reproduce the problem with firefox 102.11 esr

gberche-orange commented 1 year ago

Note that gzip output will output binary characters which may randomly not correspond to a valid unicode character. This makes the problem look as it was randomly reproducing. I saw it working once a may 19th, and then failed to get it to work for countless attempts.

@rangoo94 what's your analysis w.r.t. the suggested root cause of an invalid content-type ?

rangoo94 commented 1 year ago

Thanks @gberche-orange! I finally found the problem based on your comment.

The problem is, that jq outputs UTF-8 string. To make it working, you have to use i.e. iconv to convert it back:

cat test.har | jq -r '[ .[].entries[].request | select(.url | contains("posthog.com") and contains("gzip")) | select(.postData.text != null) | .postData.text ][0]' | iconv -f utf-8 -t iso8859-1 | gunzip

There are two kinds of PostHog calls though - gzipped with compression=gzip-js, and raw base64-encoded.

If you want to read them all at once, you may run this command:

cat test.har \
  | jq -r '[
      .[].entries[].request | select(.url | contains("posthog.com"))
      | select(.postData.text != null)
      | if .url | contains("gzip-js")
          then "echo " + (.postData.text | @base64) + " | base64 -d | iconv -f utf-8 -t iso8859-1 | gunzip"
          else "echo " + [.postData.params | select(.[].name == "data") | .[].value][0] + " | base64 -d"
        end
    ] | join("\n")' \
  | bash

It will print all of the payloads in plain text, line by line.

Alternatively, to manipulate it back as JSON array of payloads:

cat test.har \
  | jq -r '[
      .[].entries[].request | select(.url | contains("posthog.com"))
      | select(.postData.text != null)
      | if .url | contains("gzip-js")
          then "echo " + (.postData.text | @base64) + " | base64 -d | iconv -f utf-8 -t iso8859-1 | gunzip"
          else "echo " + [.postData.params | select(.[].name == "data") | .[].value][0] + " | base64 -d"
        end
    ] | join("\necho ,\n") | "echo [\n" + . + "\necho ]"' \
  | bash \
  | jq # or i.e. > payloads.json

rangoo94 commented 1 year ago

@gberche-orange, does it help with your issue? 🙂

gberche-orange commented 1 year ago

Thanks a lot @rangoo94 for your hard work on this issue and crafting this query ! I hope to get some availability to test it early next week.

Regarding the character encoding of the har, did you get the chance to test changing the content-type header into the request, and then confirm that browsers then encode the request har directly into base64 ? This would make the dashboard more compliant to specifications and would then make the inspection of the har much simpler by just decoding the base64 before uncompressing the gzip binary bytes.

gberche-orange commented 1 year ago

thanks @rangoo94 Your current script indeed helps me inspect the data posted to posthog, here is an copy below

[
  {
    "token": "phc_[...]",
    "distinct_id": "1886bb621384be-037023c622a1698-c575422-1fa400-1886bb62139731",
    "groups": {}
  },
  {
    "event": "$opt_in",
    "properties": {
      "$os": "Windows",
      "$os_version": "10.0",
      "$browser": "Firefox",
      "$device_type": "Desktop",
      "$pathname": "/tests",
      "$browser_version": 102,
      "$browser_language": "en-US",
      "$screen_height": 1080,
      "$screen_width": 1920,
      "$viewport_height": 303,
      "$viewport_width": 1908,
      "$lib": "web",
      "$lib_version": "1.57.1",
      "$insert_id": "cmz28jhbbl50yxyc",
      "$time": 1685434278.218,
      "distinct_id": "1886bb621384be-037023c622a1698-c575422-1fa400-1886bb62139731",
      "$device_id": "1886bb621384be-037023c622a1698-c575422-1fa400-1886bb62139731",
      "token": "phc_[...]",
      "$session_id": "1886bb62140d17-0d4d31d06e464c-c575422-1fa400-1886bb62141969",
      "$window_id": "1886bb621423b7-0b8cdcea1724158-c575422-1fa400-1886bb621471b1",
      "$pageview_id": "1886bb62148250-062b91ee55a1c98-c575422-1fa400-1886bb621497aa"
    },
    "timestamp": "2023-05-30T08:11:18.218Z"
  }
]

I see no more gzip-js and Content-Type: text/plain content sent. Should I assume the posthog js lib was updated in between, to a new version loaded from the internet ? I'm still using testkube helm chart version 1.11.220. Did you report an upstream issue to posthog ?

Thanks again for your help !

rangoo94 commented 1 year ago

Hi @gberche-orange, I'm happy that it helped! I didn't report it to PostHog yet - for now, I only checked their existing issues, but nobody reported any problems with that.

We didn't update posthog-js lately. but posthog-js have some logic to decide on the strategy:

sometimes it sends data in plain text,
sometimes gzipped (based on this code it supports LZ4 compression too, but I never seen it).

There is a disable_compression option too, but it may lead to unnecessary bigger transfers for users, so it would be better to avoid it.

rangoo94 commented 1 year ago

I'm closing this ticket, as I think that we can't do anything more about it unless PostHog will decide to change its implementation.

kubeshop / testkube

Unable to inspect gzip data sent to app.posthog.com: binary data sent with Content-Type: text/plain #3859