microsoft / playwright-python

Python version of the Playwright testing and automation library.
https://playwright.dev/python/
Apache License 2.0
11.89k stars 906 forks source link

[Bug]: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte #2555

Closed jolly-xw closed 2 months ago

jolly-xw commented 2 months ago

Version

1.46.0

Steps to reproduce

[Problem Description]

When I visit the URL: https://trutechtools.com/ac-refrigeration-tools.html?utm_campaign=browse-abandoner-email1&utm_medium=email&utm_source=attentive&externalId=deHNX, I used page.on("request") to listen for requests and print request.post_data as well as request.post_data_json, but I encountered an error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte. Does Playwright only support "utf-8" decoding, or is it a misuse on my part?

[Code]

import time
from playwright.sync_api import sync_playwright

if __name__ == '__main__':
    playwright_ins = sync_playwright().start()
    browser = playwright_ins.chromium.launch(headless=True)
    context = browser.new_context(
        accept_downloads=False, viewport={
            'width': 1920, 'height': 1080})
    page = context.new_page()

    def handle(request):
        print(request.post_data)
        print(request.post_data_json)

    page.on("request", lambda request: handle(request))

    url = "https://trutechtools.com/ac-refrigeration-tools.html?utm_campaign=browse-abandoner-email1&utm_medium=email&utm_source=attentive&externalId=deHNX"
    try:
        page.goto(url, timeout=5000)
    except Exception as err:
        print(err)
    try:
        page.wait_for_load_state("networkidle", timeout=10000)
    except Exception as err:
        print(err)
    time.sleep(2)

### Expected behavior

The default decoding seems to only support "utf-8", and I hope that both request.post_data and request.post_data_json can be decoded successfully.

### Actual behavior

An error occurred, seemingly because the data in the POST request is not utf-8, leading to a failed decoding.

### Additional context

_No response_

### Environment

```Text
- Operating System: [Windows 10]
- CPU: [intel core]
- Browser: [Chromium]
- Python Version: [3.8.5]
- Other info:
mxschmitt commented 2 months ago

looks expected to me, the request payload to https://o.clarity.ms/collect can't be decoded as utf-8. You need to use post_data_buffer instead:

image
jolly-xw commented 2 months ago

So is the default decoding method for post_data UTF-8? And I cannot change its encoding method, if I encounter a decoding error, I can only switch to using post_data_buffer.

mxschmitt commented 2 months ago

Yes. Closing by that.