NoamGaash commented 1 year ago

Background / Use cases

The routeFromHAR functionality encapsulates two behaviors. When update is true, it behaves as a custom network recorder, while when update is false (default) it serves as a network communication mock. I would like to suggest several improvements:

saving minimal data

today, each HAR entry contains a lot of unnecessary data - timing, HTTP version, request headers, and more. while this data might be useful for analysis, most of it is not being used for network traffic replay. Despite that omitting the unnecessary fields would make a deviation from the formal HAR specifications, I believe it is worth it in terms of clarity, bundle size, concise git differences, output predictability, and maintainability.

example

before:

      {
        "startedDateTime": "2023-03-05T05:09:17.557Z",
        "time": 8.897,
        "request": {
          "method": "GET",
          "url": "http://localhost:10102/api/pages?accountId=<my_id>",
          "httpVersion": "HTTP/1.1",
          "cookies": [],
          "headers": [
            { "name": "Accept", "value": "application/json, text/plain, */*" },
            { "name": "Accept-Language", "value": "en-US" },
            { "name": "Cookie", "value": "cookies....." },
            { "name": "Expect", "value": "202+location" },
            { "name": "Referer", "value": "http://localhost:port/path?accountId=id },
            { "name": "User-Agent", "value": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/110.0.5481.177 Safari/537.36" },
            { "name": "sec-ch-ua", "value": "" },
            { "name": "sec-ch-ua-mobile", "value": "?0" },
            { "name": "sec-ch-ua-platform", "value": "" },
            { "name": "x-client-request-id", "value": "8a887b6a-7638-4e0a-bf04-b4c97a78199e--4" }
          ],
          "queryString": [
            {
              "name": "accountId",
              "value": "<my id>"
            }
          ],
          "headersSize": -1,
          "bodySize": -1
        },
        "response": {
          "status": 200,
          "statusText": "OK",
          "httpVersion": "HTTP/1.1",
          "cookies": [],
          "headers": [
            { "name": "X-Powered-By", "value": "Express" },
            { "name": "cache-control", "value": "no-store" },
            { "name": "connection", "value": "close" },
            { "name": "content-encoding", "value": "gzip" },
            { "name": "content-type", "value": "application/json; charset=utf-8" },
            { "name": "date", "value": "Sun, 05 Mar 2023 05:09:17 GMT" },
            { "name": "strict-transport-security", "value": "max-age=31536000" },
            { "name": "transfer-encoding", "value": "chunked" },
            { "name": "vary", "value": "Accept-Encoding" },
            { "name": "x-content-type-options", "value": "nosniff" },
            { "name": "x-permitted-cross-domain-policies", "value": "None" },
            { "name": "x-xss-protection", "value": "1; mode=block" }
          ],
          "content": {
            "size": -1,
            "mimeType": "application/json; charset=utf-8",
            "text": "{\"pages\":[{\"id\":\"1234\",\"name\":\"a new created test project\",\"createdAt\":\"2023-03-05T05:09:18.0612396+00:00\",\"updatedAt\":\"2023-03-05T05:09:18.0612396+00:00\",\"source\":\"filesystem\",\"status\":\"New\",\"metadata\":{}},{\"id\":\"00000251727294129115\",\"name\":\"with implementation\",\"createdAt\":\"2023-01-29T15:37:50.8846963+00:00\",\"updatedAt\":\"2023-01-29T15:38:07.5206989+00:00\",\"source\":\"filesystem\",\"status\":\"None\",\"metadata\":{}},{\"id\":\"00000251727302447488\",\"name\":\"no implementation\",\"createdAt\":\"2023-01-29T13:19:12.5111176+00:00\",\"updatedAt\":\"2023-01-29T15:37:23.8938755+00:00\",\"source\":\"filesystem\",\"status\":\"None\",\"metadata\":{}}]}"
          },
          "headersSize": -1,
          "bodySize": -1,
          "redirectURL": ""
        },
        "cache": {},
        "timings": { "send": -1, "wait": -1, "receive": 8.897 }
      }

after:

      {
        "request": {
          "method": "GET",
          "url": "http://localhost:10102/api/pages?accountId=<my_id>"
          ]
        },
        "response": {
          "status": 200,
          "statusText": "OK",
          "httpVersion": "HTTP/1.1",
          "cookies": [],
          "headers": [
            { "name": "X-Powered-By", "value": "Express" },
            { "name": "cache-control", "value": "no-store" },
            { "name": "connection", "value": "close" },
            { "name": "content-encoding", "value": "gzip" },
            { "name": "content-type", "value": "application/json; charset=utf-8" },
            { "name": "date", "value": "Sun, 05 Mar 2023 05:09:17 GMT" },
            { "name": "strict-transport-security", "value": "max-age=31536000" },
            { "name": "transfer-encoding", "value": "chunked" },
            { "name": "vary", "value": "Accept-Encoding" },
            { "name": "x-content-type-options", "value": "nosniff" },
            { "name": "x-permitted-cross-domain-policies", "value": "None" },
            { "name": "x-xss-protection", "value": "1; mode=block" }
          ],
          "content": {
            "mimeType": "application/json; charset=utf-8",
            "text": "{\"pages\":[{\"id\":\"1234\",\"name\":\"a new created test project\",\"createdAt\":\"2023-03-05T05:09:18.0612396+00:00\",\"updatedAt\":\"2023-03-05T05:09:18.0612396+00:00\",\"source\":\"filesystem\",\"status\":\"New\",\"metadata\":{}},{\"id\":\"00000251727294129115\",\"name\":\"with implementation\",\"createdAt\":\"2023-01-29T15:37:50.8846963+00:00\",\"updatedAt\":\"2023-01-29T15:38:07.5206989+00:00\",\"source\":\"filesystem\",\"status\":\"None\",\"metadata\":{}},{\"id\":\"00000251727302447488\",\"name\":\"no implementation\",\"createdAt\":\"2023-01-29T13:19:12.5111176+00:00\",\"updatedAt\":\"2023-01-29T15:37:23.8938755+00:00\",\"source\":\"filesystem\",\"status\":\"None\",\"metadata\":{}}]}"
          }
        }
      }

excludeUrl

today, the page.routeFromHAR method receives URL option that defines which URLs should be included in the resulting HAR file. I suggest we should have excludeUrl property. the property would allow users to exclude specific path(s) from recording.

example

before:

page.routeFromHAR(
        `__tests__/networks_cache/${step}.har`,
        {
            url: /api|css|png|tff/, // user tries to list all URLs that could be memorized in the HAR file
        }
    );

after

page.routeFromHAR(
        `__tests__/networks_cache/${step}.har`,
        {
            url: /.*/, // user can use a wildcard
            excludeUrl: "http://localhost/bundle.js" // user exclude specific path
        }
    );

smart matching algorithm

Playwright docs state that:

HAR replay matches URL and HTTP method strictly. For POST requests, it also matches POST payloads strictly. If multiple recordings match a request, the one with the most matching headers is picked

Would it be possible to grant users more control over he match algorithm?

example

page.routeFromHAR(
        `__tests__/networks_cache/${step}.har`,
        {
            url: /api/,
            urlMatching: {
                "GET": {
                    matchLocation: "ignoreOrigin",
                    matchHeaders: false,

                },
                "POST": {
                    matchPayload: false
                },
                "delete": {
                    match: (harEntry, currRequest) => {
                        return userDefinedMatchScoringFunction(harEntry, currRequest);
                    }
                }
            }
        }
    );

summery

Thank you for considering my suggestions! Please note that my first suggestion is a breaking change - it would omit data from the HAR file, and there is a chance that some users rely on the timing/headers data in there. I would love to hear your opinion and get some feedback before implementing any changes or submitting PRs.

Thank you for maintaining Playwright! I love this tool. Noam

fs-projects commented 1 year ago

Hi All,

I think this is the most appropriate place I could think of to post one of my query.

I recently setup a HAR in my tests that captures all network requests in a HAR file when update flag is true and then use this HAR file to serve the mock API responses when update flag is false. Few things I observed that are kind of blocking me to actually leverage HAR functionality. Please see below -

There is an API call that takes sometime to get the results back. It's output is quite huge and when I see the preview tab in network section of chrome, I see this message Failed to load response data: Request content was evicted from inspector cache. Although UI was able to get all the data and display the results but I am not able to see any data in preview tab in network section of chrome. Logically suspecting this is causing playwright to NOT record the response of the API as well in HAR file. In HAR file there is no _file key for this particular network call.

Workaround : To fix this I manually added _file key and assigned a value test.json to it. I created test.json in my har directory and added the response of this API. How I got the response? I called this API in postman to get the full response.

Now applying above workaround I ran everything again. Then there is another problem I encountered. There are some urls in some API calls that passes a query parameter start to backend. It's an epoch time value. When I record HAR with update flag as true, start has certain value in it corresponding to the epoch time when API was called. When I use the same HAR file with update flag as false then the same API url doesn't match with the one in HAR file as the new url is different from what was saved in HAR file because start parameter has a different value now.

www.example.com/api/v2/collect/?start=1695901908803 -- Saved in HAR file when recorded(when update flag true) www.example.com/api/v2/collect/?start=1695901912345 -- Saved when test is run from HAR(when update flag false) then application calls the same API but with different value of start time. Due to this when playwright mocks this API call it doesn't find the exact match in HAR file and the mocking fails by default.

For point 2 I wanted to know if there is anything I can do ensure that even though the there is a change in start parameter playwright mocks the API because for testing purpose I don't care about the result difference between the two timestamps. I just want playwright to use the same mock again.

One lengthy method is that I mock such Apis separately with glob pattern and fulfill it with the whatever result I want. But doing so I won't be able to leverage the HAR power and this might become cumbersome for other pages also where such type of APIs are called.

Any thoughts/guidance highly appreciated.. Thanks!!

NoamGaash commented 1 year ago

@fs-projects in my projects I use things like:

  await page.routeFromHAR( /* ...........  */)
  await page.route(/v2\/collect/, (route) => {
    const url = route.request().url().replace(/\d+/, "1695901908803")
    route.fallback({url})
  })

But I feel terrible doing it. its not maintainable at all. I'm trying to implement the smart matching algorithm into Playwright, but I'm facing some technical difficulties regarding the protocol.yml file. If anyone want's to join my efforts, I would love to have an online meeting and get some feedback I'm available on noam.gaash@applitools.com

fs-projects commented 1 year ago

@NoamGaash I was on leave for some days. Thanks for sharing this workaround. I tweaked it as per my app and it worked. I know it's not maintainable but great help. Could you please let me know if my understanding is correct -

await page.routeFromHAR( /* ........... */) -- Will serve the request from HAR file instead of actual network calls.

    const url = route.request().url().replace(/\d+/, "1695901908803")
    route.fallback({url})
  })

--Ensures all matching urls are replaced as specified and then searched and served from the HAR file? Let me know in detail if possible.

Also I am happy to connect with you to join your efforts. I will shoot out an email to you and you can let me know if we can proceed.

NoamGaash commented 1 year ago

sure! thanks.

Playwright interceptors (page.route, page.routeFromHar) are implemented with a stack mechanism - the first route function you register will be the last to handle the request. for example, consider the following URL: www.example.com/api/v2/collect/?start=1695901912345 and the following code:

  await page.routeFromHAR( /* ...........  */)
  await page.route(/v2\/collect/, (route) => {
    const url = route.request().url().replace(/\d+/, "1234")
    route.fallback({url})
  })

first, the URL will meet the last page.route.
the route.fallback will make a request to www.example.com/api/v2/collect/?start=1234 to proceed to the next interceptor
the new (constant) URL will meet page.routeFromHAR
everyone is happy (except us the programmers)

NoamGaash commented 1 year ago

Hi everyone! I've made a small npm-package, that solve that issue. With the playwright-advanced-har package, you'll be able to:

Ignore port numbers
Ignore search params
Shuffle response order

and much more

please don't hesitate to open an issue with any question, and I'll do my best to help any use case

@fs-projects your use case can be solved using:

import { test, defaultMatcher } from "playwright-advanced-har";

test("ignore search params", async ({ page, advancedRouteFromHAR }) => {
    await advancedRouteFromHAR("tests/har/different-search-params.har", {
        matcher: (request, entry) => {
            const reqUrl = new URL(request.url());
            const entryUrl = new URL(entry.request.url);
            reqUrl.search = "";
            entryUrl.search = "";
            if (
                reqUrl.toString() === entryUrl.toString() &&
                request.method() === entry.request.method &&
                request.postData() == entry.request.postData?.text
            ) {
                return 1;
            }
            return -1;
        },
    });
    await page.goto("www.example.com/api/v2/collect/?start=" + Date.now());
});

bcowgill commented 11 months ago

I've just had a difficulty with routeFromHAR where we have a cache busting parameter in our API url so the har file does not match.

/api?a=3648574675856 for example where a changes on every api call.

I have basically made my own version of routeFromHAR to match the URL and Method but pass in a santitseUrl function which is called on the route.request().url() and the har.log.entries[].request.url values before matching them.

sanitiseUrl(url) url.replace(/([?&]a=)\d+/. "$1NNNNNNNNN")

You might consider adding such a parameter here to advanced options if you are planning improvements.

NoamGaash commented 11 months ago

@bcowgill I believe this snippet will solve your use case:

import { test, customMatcher } from "playwright-advanced-har";

const fixUrl = url => url.replace(/([?&]a=)\d+/. "$1NNNNNNNNN")
test("ignore `a` get argument", async ({ page, advancedRouteFromHAR }) => {
    await advancedRouteFromHAR("tests/har/my-file.har", {
        matcher: customMatcher({
            urlComparator(a, b) {
                return fixUrl(a) === fixUrl(b);
            },
        }),
    });
    await page.goto("/api?a=3648574675856");
});

bcowgill commented 11 months ago

@bcowgill I believe this snippet will solve your use case:
import { test, customMatcher } from "playwright-advanced-har";

Hey, thanks, how about this other issue, will it solve that?

routeFromHAR header Access-Control-Allow-Origin should be configurable need to replace test env domain with localhost domain to play back captures locally https://github.com/microsoft/playwright/issues/28447

NoamGaash commented 11 months ago

@bcowgill actually, I'm considering adding that exact feature https://github.com/NoamGaash/playwright-advanced-har/pull/6/files It will let you alter the entry found in the HAR. WDYT? Please open an issue for that, I have several ideas but I want to be backward compatible as much as possible, therefore I release new features very cautiously

fs-projects commented 11 months ago

Hi everyone! I've made a small npm-package, that solve that issue. With the playwright-advanced-har package, you'll be able to:

Ignore port numbers

Ignore search params

Shuffle response order

and much more

please don't hesitate to open an issue with any question, and I'll do my best to help any use case

@fs-projects your use case can be solved using:
import { test, defaultMatcher } from "playwright-advanced-har";

test("ignore search params", async ({ page, advancedRouteFromHAR }) => {
  await advancedRouteFromHAR("tests/har/different-search-params.har", {
      matcher: (request, entry) => {
          const reqUrl = new URL(request.url());
          const entryUrl = new URL(entry.request.url);
          reqUrl.search = "";
          entryUrl.search = "";
          if (
              reqUrl.toString() === entryUrl.toString() &&
              request.method() === entry.request.method &&
              request.postData() == entry.request.postData?.text
          ) {
              return 1;
          }
          return -1;
      },
  });
  await page.goto("www.example.com/api/v2/collect/?start=" + Date.now());
});

Thank you very much @NoamGaash

giladgd commented 11 months ago

Is there any update about this? When will it be part of Playwright?

NoamGaash commented 8 months ago

@bcowgill I believe this snippet will solve your use case:
import { test, customMatcher } from "playwright-advanced-har";
Hey, thanks, how about this other issue, will it solve that?

routeFromHAR header Access-Control-Allow-Origin should be configurable need to replace test env domain with localhost domain to play back captures locally #28447

Hi, version 1.3.1 now supports intercepting the responses from the HAR file, so it solves your use case as well :)

tschoartschi commented 6 months ago

I'm also struggling to integrate the routeFromHAR functionality nicely into our project. Basically, everything works but update: true starts to clutter our git repository. Let me explain why:

What we do is the following:

  await page.routeFromHAR(harFile, {
    update: options.update,
    updateContent: 'attach',
    notFound: 'fallback',
    url: NETWORK_URL_REGEX_FOR_MOCK,
    updateMode: 'minimal',
  });

Now this creates all the HAR files and creates files for the responses. For example you will find the following in a HAR file:

"content": {
  "size": -1,
  "mimeType": "application/json",
  "_file": "0eeb62f9e778c07885a5323f4938ccb30969bdb0.json"
},

The filename is based on the sha1 of the content of the file (and the content is the response of our backend). Now my problem is, that every response from our backend contains a meta field in the JSON like the following

"meta": { "total": 1, "serverTime": "2024-04-26T11:11:46.789Z"

Now the hash is always different because of the serverTime entry. Therefore every run with update: true creates dozens of new files that are only different because of serverTime.

I tried to work around this problem with the following:

  await page.routeFromHAR(harFile, {
    update: options.update,
    updateContent: 'attach',
    notFound: 'fallback',
    url: NETWORK_URL_REGEX_FOR_MOCK,
    updateMode: 'minimal',
  });

  await this._page.route(API_URL_REGEX, async (route) => {
    const response = await route.fetch();
    const json = await response.json();
    if (json.meta?.serverTime) {
      json.meta.serverTime = '2024-01-01T00:00:00.000Z';
    }
    await route.fulfill({ response, json });
  });

Now it seems like the changed response does not end up in the HAR file. Am I doing something wrong? I'm a little bit lost because I'm not sure if I try to do something completely crazy or if it's a valid case that I try to solve. Maybe someone has an idea. Maybe I can solve my problem with playwright-advanced-har but I'm not sure how. Basically, I want the normal routeFromHAR functionality but it should not create unnecessary files.

Maybe we could have some possibility to change the response before it's handed over to the HAR generation?

NoamGaash commented 6 months ago

@tschoartschi I'm trying to understand the use case - why would you update your har file often?

Regarding intercepting the request using page.route - it won't change the content of the HAR file (see #29190). Maybe it's by design - it can be convenient to rely on the fact that the har file reflects the real network traffic occurred.

Have you considered using updateContent: "embed"?
Also, consider using a teardown phase to clean outdated artifacts

tschoartschi commented 6 months ago

@noamGaash thanks for the fast response 🙂

Our app is pretty data-intensive and makes lots of requests. One of our most used examples makes 89 requests to our backend. Mocking each of these 89 requests manually is tedious this is why I thought about using HAR files.

We want to commit the HAR files to our git repo so that every dev has the same mocking data. Also, CI and QA-checks should use those HAR files.

If I have two tests for example:

const update = true;
test('override a network call', async ({ page, context }) => {
  await page.routeFromHAR('har1.har', {
    update: update,
    updateContent: 'attach',
    notFound: 'fallback',
    url: NETWORK_URL_REGEX_FOR_MOCK,
    updateMode: 'minimal',
  });
  // run the test
});

test('do some other test', async ({ page, context }) => {
  await page.routeFromHAR('har2.har', {
    update: update,
    updateContent: 'attach',
    notFound: 'fallback',
    url: NETWORK_URL_REGEX_FOR_MOCK,
    updateMode: 'minimal',
  });
  // run the other test test
});

I end up with 89 * 2 = 178 files. Although 89 would be enough. This adds up the more tests I have 🤔 and quickly I have thousands of files...

I think using updateContent: "embed" only hides the problem because then every HAR file becomes unnecessarily big.

I also thought about cleaning up in teardown but essentially every file is referenced in some HAR file. Sure I could create a complicated clean-up logic that tries to find files with the same content and change the reference in the HAR files and then delete unused files. But that sounds like a lot of hassle for that that the default solution almost does what we need 🙂

Meanwhile, I think it might be better if I just wrote my own capture and mock logic for our backend. Based on page.route.

Let me know if I explained my problem properly now 🙂 if anything is unclear I can try to explain it even in more detail

tschoartschi commented 6 months ago

@NoamGaash we have now implemented our own logic. It's based on page.route, similar to what they show in the docs here: https://playwright.dev/docs/network#modify-requests

The idea is to create JSON files if we want to update (similar to the HAR file generation), and when we do not want to update we read those JSONs.

This gives us much more flexibility and eases writing tests a lot for us 🙂

NoamGaash commented 6 months ago

@tschoartschi interesting! If you'll change your mind, I'm always open for contributions for the advancedRouteFromHAR fixture. I think that making the postProcess function change the actual saved file can be a nice feature

vitalets commented 4 months ago

After struggling with HAR similar to @tschoartschi, we've also ended up with own solution based on pure page.route. I think HAR is not the best format, when you need fine-grained control of network in e2e tests. Shared the solution to open source playwright-network-cache.

microsoft / playwright

[Feature] advanced configurations for routeFromHAR #21405

Background / Use cases

saving minimal data

example

excludeUrl

example

smart matching algorithm

example

summery