microsoft / playwright-python

Python version of the Playwright testing and automation library.
https://playwright.dev/python/
Apache License 2.0
11.63k stars 883 forks source link

[Question]: How to get the content of an api request #1603

Closed gamorav closed 1 year ago

gamorav commented 1 year ago

Your question

Hi, I have this code:

from playwright.sync_api import sync_playwright

def intercept_response(response):
    # we can extract details from background requests
    if response.request.resource_type == "xhr":
        if "https://api.investing.com/api/financialdata/historical/" in response.url:
            print(response.url)        
    return response

url = "https://www.investing.com/commodities/crude-oil-historical-data"

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.on("response", intercept_response)
    page.goto(url)
    browser.close()

My problem is that only I get the url, method and headers of this api request:

https://api.investing.com/api/financialdata/historical/...

But, I want to get this content:

{"data":[{"direction_color":"redFont","rowDate":....

How can I do that with playwright? Is it possible?

Thanks!

gamorav commented 1 year ago

Hi, I have been working around this. Now, I am using the HAR file record.

The code:

from playwright.sync_api import sync_playwright
from pprintpp import pprint as pp

url = 'https://www.investing.com/commodities/crude-oil-historical-data'

def intercept_response(response):
    # we can extract details from background requests
    if response.request.resource_type == "xhr":
        if "/api/financialdata/historical/" in response.url and response.status == 200:
            pp(response.url)            
    return response

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context(record_har_path="example.har", record_har_url_filter="**/api/financialdata/historical/**")
    page = context.new_page()
    #page.on("response", intercept_response    
    page.goto(url)
    page.wait_for_timeout(2000);
    context.close()
    browser.close() 

When I use "headless=False" the response content is recorded well:

"content": {
            "size": 9835,
            "mimeType": "application/json",
            "compression": 7312,
            "text": "{\"data\":[{\"direction_color\":\"re......

But, "headless=True" doesn't record it:

 "response": {
          "status": -1,
          "statusText": "",
          "httpVersion": "HTTP/1.1",
          "cookies": [],
          "headers": [],
          "content": {
            "size": -1,
            "mimeType": "x-unknown"

How Can I fix that?

Thanks!

mxschmitt commented 1 year ago

The reason why its recorded in headed but not in headless is most likely because the site you are automating is having a bot protection to protect them from scrapers.

Coming back to your original question, how to get the request boy, you can do it like that:

route.request.post_data

if you just want to sniff the traffic, see here: https://playwright.dev/python/docs/network#network-events

and here for the methods: https://playwright.dev/python/docs/api/class-response

mxschmitt commented 1 year ago

Closing as part of the triage process since it seemed stale. Please create a new issue with a detailed reproducible or feature request if you still face issues.

gamorav commented 1 year ago

Thanks for the help and explanation.