microsoft / playwright-testing-service

MIT License
88 stars 13 forks source link

[BUG]:[Tests running on Microsoft Playwright service are inconsistent and failing] #103

Open RF2023 opened 7 months ago

RF2023 commented 7 months ago

Describe the bug Tests running on Microsoft Playwright service are inconsistent. Same tests have better pass rate via localhost and our non prod env vs running on playwright service.

To Reproduce Steps to reproduce the behavior: N/A

Expected behavior Less errors and pass rate should match our local env runs with less time

Screenshots

Setup information (please complete the following information):

Additional context We have implemented playwright service for few tests, especially when a Pull request is open it runs on 10 workers for 25 tests and once the Pull request is merged it runs on 10 workers for 40 tests and the pass rate is not 100% and throws errors as shown above. We have observed some discrepancies on the results running on the service vs the tests without the service and only on 4 workers (100% pass rate). Couple of tests we have identified that backend was sending data but it was not able to read the data being sent and the UI verification fails.

We tried running 25 tests across 10 workers across 5 instances in parallel and only 3 of them passed with 40-0 and the other two never finished.

Another thing to note is we have few tests which need to run in serial mode, due to the nature of file upload and other scenarios and one test fails in that all of them get skipped.

Call log:
  - <ws connecting> wss://eastus.api.playwright.microsoft.com/api/authorize/connectSession
  - <ws error> wss://eastus.api.playwright.microsoft.com/api/authorize/connectSession error WebSocket was closed before the connection was established
  - <ws connect error> wss://eastus.api.playwright.microsoft.com/api/authorize/connectSession WebSocket was closed before the connection was established
  - <ws disconnected> wss://eastus.api.playwright.microsoft.com/api/authorize/connectSession code=1006 reason=
  - <ws connecting> wss://browser.playwright.microsoft.com/ws

AB#2017175

puagarwa commented 7 months ago

Hi @RF2023 thanks for reaching out to us and explaining your issue in detail. I looked into your issue in our telemetry and i found that the failure and latency issue you are facing is due to browser quota getting exhausted for your account.

We provide 50 parallel worker as default quota which means at any point of time your 50 request would immediately get browser but beyond that they will get queued at our side until your timeout happen or any existing browser closes.

I could see multiple instances in last few days where your account has gone beyond the quota limit and hence either request got delayed or they got timedout and failed.

Two such instances At: 01 April 11:20 GMT 109 request -> out of those 50 request ran success, 23 failed

At: 01 April 11:33 IST 78 request -> 55 ran success, 13 failed

There are two things we can do about it

The probable reason for your quota going beyond 50 is you have multiple jobs running in parallel. Even though individually your job uses only 10 worker but it get multiplied with number of jobs in parallel.

RF2023 commented 7 months ago

Thanks for responding back to us, so if i understood correctly and lets take this example. When we open a PR we have 25 tests that will run on playwright service on 10 workers. So the consumption is only 10 correct? Similarly when we merge the PR its 10 workers against 40 tests . The data you shared with us is true for those instances because we did 5 parallel runs and that was total of 50 workers for 125 tests. Did i understand that correctly?

RF2023 commented 7 months ago

Timeout is set at default 30000

puagarwa commented 7 months ago

HI @RF2023 Yes, PR with 10 worker would consume 10 parallel workers. Yes, build with 5*10 worker would consume 50 workers

I think whats probably happening here is sometime your multiple worklow might have ran in parallel either multiple PR build or multiple post PR test workflow. which would increase your total parallel browser count to more than 50.

we could increase it to 2-3mins for the case when your workflow consume more than quota and browser latency would increase.

RF2023 commented 7 months ago

so the 50 workers was a test, we can ignore that we want to go in depth for the one with 10 workers with 30 tests and 40 tests and why there is a discrepancy with our local runs.

puagarwa commented 7 months ago

Sure, can you please do a fresh run. please make sure you dont run other workspace/jobs in parallel to avoid quota issue and share us with time stamp and result from your side.

RF2023 commented 7 months ago

sounds good.

RF2023 commented 7 months ago

Triggered one just now.

RF2023 commented 7 months ago

8541740441-2-b4bed07479969177b44f6cf6b5f7227c1d19e986 The test run history doesnt allow us to expand a particular run, is there any other place we can see the logs or results?

puagarwa commented 7 months ago

8541740441-2-b4bed07479969177b44f6cf6b5f7227c1d19e986 The test run history doesnt allow us to expand a particular run, is there any other place we can see the logs or results?

The test run history is not clickable as it only contain few metadata information around run duration and max worker. You would already have all logs at your client side itself. We dont store any logs at server side.

puagarwa commented 7 months ago

HI @RF2023 I checked our telemetry and i can confirm that in last 4 runs, i dont see any failures as load is within quota. Let me know if you see any issue at your side.

miguelofbc commented 7 months ago

Hello @puagarwa !

Thank you for your previous assistance. We're following up on our discussion about worker quotas.

We recently ran a test suite with the following configuration:

We ensured nothing else was running concurrently. Based on our understanding, this setup would require a quota of 50 workers, is that correct?

For reference, here are the run IDs:

  1. 8550393480-1-32a7cf53994942134c02b0dfa95005756dc13110
  2. 8550393116-1-32a7cf53994942134c02b0dfa95005756dc13110
  3. 8550394120-1-32a7cf53994942134c02b0dfa95005756dc13110
  4. 8550393820-1-32a7cf53994942134c02b0dfa95005756dc13110
  5. 8550394373-1-32a7cf53994942134c02b0dfa95005756dc13110

However, we encountered failures in two of the runs:

  1. Error: locator.fill: Target page, context or browser has been closed (SessionID: 3f3601c0-6aa9-4ccd-aa57-b8f3afc07ebd)
  2. Error: locator.fill: Test timeout of 180000ms exceeded (SessionID: 9110dfc8-f36a-4532-9344-d0513a2daef9)

Interestingly, when we run the same tests with this configuration on our own infrastructure, everything works as expected. This suggests the tests themselves might not be flaky.

Would you be able to share your insights on these failures and confirm the required worker quota for our scenario?

puagarwa commented 7 months ago

Hi @miguelofbc , First thing to confirm that the above failures are not due to quota issue, as you stated you were well under quota limit hence you didnt face any issue during this run.

Coming to the above two failures, both of these happened during the test steps.

Error: locator.fill: Test timeout of 180000ms exceeded (SessionID: 9110dfc8-f36a-4532-9344-d0513a2daef9)

This one seems like test timeout, please check trace or video to understand why this happened

Error: locator.fill: Target page, context or browser has been closed (SessionID: 3f3601c0-6aa9-4ccd-aa57-b8f3afc07ebd)

This one does seems like browser crash at our end, it should be one off, let me know if you see it again.

puagarwa commented 7 months ago

Also, one thing you mentioned above around local vs service testing. You mentioned during local testing, you use local endpoint whereas with service you are using public endpoint, it would be better to compare local with public endpoint.

miguelofbc commented 7 months ago

Hello @puagarwa.

I apologize if my previous message caused any confusion.

To clarify, when I mentioned "our own infrastructure," I meant our CI environment, not local machines. We are running tests in parallel on our CI system, not locally.

W'd appreciate it if you could confirm which failures in our previous message indicate exceeding the worker quota. Are these failures communicated to users in any way?

The specific error we encountered was a TimeoutError: browserType.connect. Is this the expected error when the quota limit is reached?

Timeout Error Details:

Call log:
  - <ws connecting> wss://eastus.api.playwright.microsoft.com/api/authorize/connectSession
  - <ws error> wss://eastus.api.playwright.microsoft.com/api/authorize/connectSession error WebSocket was closed before the connection was established
  - <ws connect error> wss://eastus.api.playwright.microsoft.com/api/authorize/connectSession WebSocket was closed before the connection was established
  - <ws disconnected> wss://eastus.api.playwright.microsoft.com/api/authorize/connectSession code=1006 reason=
  - <ws connecting> wss://browser.playwright.microsoft.com/ws

And we did look into the trace file.

Here's the key point: We experience flakiness with Playwright service during parallel runs (5 runs) that doesn't occur in our existing CI environment. Even though these failures might be happening during the test phase, due to quota, etc., these tests have been running under similar load for months without issues. Our goal is to adopt Playwright service, but we need to ensure reliability at least matches our current infrastructure.

Would it be helpful if I provided additional logs from the failing runs?

puagarwa commented 7 months ago

Hi @miguelofbc , If you hit quota error, you would get timeout error like below or 409 error immediately. so its easy to identify if its quota issue or any other issue during the test phase.

TimeoutError: browserType.connect

I understood that you are running it from CI, but i was referring to this line

Same tests have better pass rate via localhost and our non prod env vs running on playwright service.

I like to confirm here when you are comparing with service vs without service runs, is your testing endpoint same ? My assumption from above line was that you were earlier running with localhost endpoint but with service you have shifted to prod endpoint. So in order to have fair comparison , i advise to use same test endpoint with non-service run and check.

But irrespective of it, We can work on failures we are getting with service and try to understand it. Can you run few times and share failure details. We can understand if there is pattern like particular set of tests prone to failure or particular type of failure.

Few suggestion, i can provide in general

miguelofbc commented 7 months ago

Hello @puagarwa,

Thanks for the feedback. Please find inline my answer for your questions.

I like to confirm here when you are comparing with service vs without service runs, is your testing endpoint same ?

The testing endpoint is the same, yes. We are running against localhost under CI. We tried to replicate the exact same configuration/architecture, before coming to you.

But irrespective of it, We can work on failures we are getting with service and try to understand it. Can you run few times and share failure details. We can understand if there is pattern like particular set of tests prone to failure or particular type of failure.

Appreciate the offer for more testing, but have you had a chance to analyze the previously mentioned failures in more depth? It would be helpful to understand if you're also looking into the root cause of these issues. For example, the browser crashed with the error: "locator.fill: Test timeout of 180000ms exceeded (SessionID: 9110dfc8-f36a-4532-9344-d0513a2daef9)". Have you been able to investigate what caused this crash?

With service since it works on client-service model, individual tests time could increase due to network hop but the idea is with more parallel workers, it get compensated to keep overall time low.

Sorry, I need a bit more context to understand your point fully. Could you elaborate on what kind of service tests you're referring to? Knowing the specific test types (functional, performance, etc.) would help me explain the impact of the client-server model better.

puagarwa commented 7 months ago

HI @miguelofbc , Regarding test failure analysis: The reason for asking to perform more test run is to find out pattern of failure. in last run you shared two failure, out of that one seems to service related which we trying to understand more. I want to understand if its happening regularly or one off instance. and we are also looking into more detail at our end as well. You can as well look into already ran instances to check for failed test. You can leave out browser.connect timeout failures but we can look into others if any to find pattern.

Regarding client-service model: Your tests script creates a websocket connection with our service browser and then send out command to perform browser operations and receive response to execute tests. Unlike when you run without service, there is no websocket connection but test script is passing command to local browser in same machine, hence due to network getting involved in service mode, its expected to see higher test time if you compare for individual tests for with service vs without service. But the benefit of service is it allows you to use higher parallelism which eventually reduce overall test run time.

RF2023 commented 7 months ago

Here are more runs, please let me know if you need anything else. Thank you !!

8565671485-1-0d2bec4b4e4fdd8af2ea1acb0f977a87e9b743ce 8565671572-1-0d2bec4b4e4fdd8af2ea1acb0f977a87e9b743ce 8565671675-1-0d2bec4b4e4fdd8af2ea1acb0f977a87e9b743ce 8565671794-1-0d2bec4b4e4fdd8af2ea1acb0f977a87e9b743ce 8565671900-1-0d2bec4b4e4fdd8af2ea1acb0f977a87e9b743ce

miguelofbc commented 7 months ago

Hello @puagarwa,

We've tried again and here is the report.

Configuration:

Run IDs:

RF2023 commented 7 months ago

8573073995-1-7dbb6139d78d934aa71a0b9ecc2c9762e61a6f8a 8573050563-1-420f4361667d5fcac73d5882406109e294d09d3a

Just to add back to back runs have failed with the following error: Time stamp 12:53 PM EST Error: browserType.connect: Target page, context or browser has been closed

two tests in each run

puagarwa commented 7 months ago

Hi @RF2023, @miguelofbc , Thanks for the detail, Let me check and get back. Meanwhile, if you have handy please share sessionId, that would be helpful in debugging.

miguelofbc commented 7 months ago

Hi @RF2023, @miguelofbc ,

Thanks for the detail, Let me check and get back. Meanwhile, if you have handy please share sessionId, that would be helpful in debugging.

Sorry, but this time couldn't find that in the report. Is there any other way to get the session ID on our end?

puagarwa commented 7 months ago

HI @miguelofbc No issue, i found the issue.

This was related one issue which happen today in our infra, We have fixed it now. It shouldnt happen again. Sorry for the inconvenience.

RF2023 commented 7 months ago

Our last few runs have been stable, thanks for making the fix. Over the course of last few runs we observed only one failure.

sessionID 291b6586-6aa2-43b3-bd43-c87a68ac6882

error: Test timeout of 180000ms exceeded.

miguelofbc commented 7 months ago

Hello @puagarwa ! Another failure: Error: locator.fill: Target page, context or browser has been closed

SessionId = 7d374527-ba2f-4c59-87d6-26f52c294db0

puagarwa commented 7 months ago

SessionId = 7d374527-ba2f-4c59-87d6-26f52c294db0

This is due to our infra machine getting restarted. Usually, this failure should be very rare but we will check to make sure they shouldn't be frequent.

One thing, is it possible for you to add retry in your test suite to counter for these occasional failures and keep your build green.

miguelofbc commented 7 months ago

Thank you ! Unfortunately, we cannot. We moved away from retry to make sure that we are looking into every failure. We were using retry, and it was really useful for us, but only until we got stable enough to remove that. From the previous suggestions the only thing we can consider is increasing the timeout.

puagarwa commented 7 months ago

Let me bucketize the issues we faced so far and possible solution.

RF2023 commented 7 months ago

Let me bucketize the issues we faced so far and possible solution.

  • error: Test timeout of 180000ms exceeded. : This is most likely test issue where test couldnt get completed in 3mins, i will suggest you to check trace to understand where it took more time. if you think its genuinely can take more than 3mins, then increase test timeout.
  • Error: locator.fill: Target page, context or browser has been closed : This is our rare infra issue where machine get restarted due to maintenance and hence test connection drops and test fails. Unfortunately, not much can be done here but retry.
  • Error: browserType.connect: Target page, context or browser has been closed : This was our temporary infra issue which shouldnt happen again after our fix.
  • browserType.connect: Timeout 30000 , this was due to throttling which you you faced initial when ran more than 50 parallel worker. We could increase timeout but best is to make sure parallel worker should be less than 50 or raise request to increase quota.

Thank you for the detailed explanation and fixing the infra issue helping us to be more stable. We will look into the time for the test running as we have few runs planned today and tomorrow. Will keep you posted on how it goes. Appreciate it !

vvs11 commented 7 months ago

Hi @RF2023 I wanted to check in with you about the stability of your tests. Are you seeing improvements?

RF2023 commented 7 months ago

Hello, in our last run we are seeing tests failing which have to run serially and we observed it before as well. Can you please take a look at the last three runs and see any discrepancies from your side. One run was 153 test cases - 8716229596-1-9b204a026df46d1e4c161ea5f99570a384475505 and another one was 278 test cases- 8716326108-1-9b204a026df46d1e4c161ea5f99570a384475505 across 10 workers.

puagarwa commented 7 months ago

Hi @RF2023 We are looking into these runs. BTW can you share what error did your test gave during failure. We need to understand if this is due to service issue or something went wrong at client side. If possible share error log and sessionId if possible.

RF2023 commented 7 months ago

Errors: Error: browserType.connect: Timeout 30000ms exceeded, this is happening only with 10 workers and nothing else running in parallel. Test timeout of 180000ms exceeded(Session id: 3431c997-3f24-401d-b0ed-df0021796a24) Session id: 2c157fa2-674d-4641-856b-93463c54d511

puagarwa commented 7 months ago

HI @RF2023 Thanks for sharing details. We looked into these sessions and below are finding

miguelofbc commented 7 months ago

@puagarwa could you please elaborate on what is the issue and if that is being taken care?

for "Error: browserType.connect: Timeout 30000ms exceeded" : This is our infra issue.

puagarwa commented 7 months ago

@puagarwa could you please elaborate on what is the issue and if that is being taken care?

for "Error: browserType.connect: Timeout 30000ms exceeded" : This is our infra issue.

This is due to underlying browser got into bad state during the connection time and hence we couldnt connect your request. Ideally, this should be rare but we are planning a long term fix for it this month which should take care of it.

RF2023 commented 6 months ago

@puagarwa do you have an update on when it will be released??

puagarwa commented 6 months ago

@puagarwa do you have an update on when it will be released??

Hi @RF2023, It is currently in our backlog with no confirmed date of release. We are also monitoring telemetry to understand its frequency and accordingly bump up priority. Please let us know if you are facing it with more frequency.

RF2023 commented 5 months ago

@puagarwa this issue can be closed as you confirmed us that there is no date for the release yet.

RF2023 commented 4 months ago

Hello folks, in our latest run we observed a new error which i am attaching below and some tests failed to start on playwright service with the following error. All tests ran on westus3 region. Can you guys please confirm if there was an issue or update that was going on when we ran our tests as our results dropped significantly on this particular run.

Screenshot 2024-07-04 at 12 06 52 AM

Test was skipped due to setup error: There is no session id or trace information attached in the report for this error

Screenshot 2024-07-04 at 12 06 16 AM

Session ids for tests that failed to start:

  1. 03343a27-30e7-4a60-a739-6d3bf6e73c1e browserContext.newPage Error: browserContext.newPage: Test timeout of 300000ms exceeded. Test timeout of 300000ms exceeded while running "beforeEach" hook.

  2. 0724a94a-7522-4683-b07d-d57d3b6dfe49 browserContext.newPage Error: browserContext.newPage: Test timeout of 300000ms exceeded. browserContext.newPage Test timeout of 300000ms exceeded while setting up "page".

  3. 50b1110b-c06c-4b1b-a477-6b0d3ec63294 Error: page.goto: Test timeout of 300000ms exceeded. Call log: navigating to "http://localhost:4000/", waiting until "load" Error: page.goto: Test timeout of 300000ms exceeded. Call log: navigating to "http://localhost:4000/", waiting until "load" at HomePage.navigateToHomepage (/runner/_work/xxx/xx/xx-test-pw/src/pages/xxpage.ts:24:21) at /runner/_work/xx/xx/xx-test-pw/src/tests/xxxx/xx/xxx.test.ts:17:22

Error: page.screenshot: Test timeout of 300000ms exceeded. Call log: taking page screenshot waiting for fonts to load... fonts loaded

  1. 8d36f12c-6ce9-4d00-b6b7-26ae3e735138 browserContext.newPage Error: browserContext.newPage: Test timeout of 300000ms exceeded. browserContext.newPage Test timeout of 300000ms exceeded while running "beforeEach" hook.

  2. 96872a93-1ba2-4611-910f-3dcc0a2ee12c browserContext.newPage Error: browserContext.newPage: Test timeout of 300000ms exceeded. browserContext.newPage Test timeout of 300000ms exceeded while running "beforeEach" hook.

  3. 51dafd6f-bb50-42f7-b500-6a44381fe239 browserContext.newPage Error: browserContext.newPage: Test timeout of 300000ms exceeded. browserContext.newPage Test timeout of 300000ms exceeded while running "beforeEach" hook.

puagarwa commented 4 months ago

HI @RF2023, I looked into our telemetry and for all of these connections there wasnt any issue in browser allocation. Enfact the error is also showing that issue came during browserContext.newPage which means browser launch is successful.

I think the endpoint you are trying to open has some issue and hence its timing out. Can you please check your setup.

Also, i observe in above log you are using localhost , are you using exposeNetwork in servic config ? if yes, only thing need to check is this localhost endpoint is working fine.

miguelofbc commented 4 months ago

Hello @puagarwa, thanks for following up! When we run the exact same tests in our infrastructure, everything works as expected. We are only seeing this issue happening while running in ms pw service. The setup is just fine on our end. Could you please further debug on your side?

Also, we do see a degradation for the last few days on ms pw service? Were there any outage?

miguelofbc commented 4 months ago

Hello @puagarwa, this is blocking us at the moment. We are running the exact same setup within our infrastructure and everything works as expected. Could you please take a look?

Some more failures for today's run.

Region -> westus3

SessionIds:

puagarwa commented 4 months ago

HI @miguelofbc, I checked your account failure in last 7 days holistically and i dont find many failures due to service infra. The above issue around browserContext.newPage timeout is mostly due to page load issue and we should check your local/public webapp. Can you please enable tracing/snapshot to see why newPage is timing out ? you can also try running it from local machine to easily debug instead of CI.

Also, when you use service mode, where is your endpoint hosted? is it localhost or public endpoint or private endpoint ?

Also I observe that on June, 02 around 14:30 to 15:00 UTC, there was some timeouts which is due to throttling. Basically there might be more parallel pipelines ran from your side which have exceeded the quota and hence start throttling and hits client timeout. You could raise quota request or increase timeout from service config. But i have only seen this issue on that one instance.

We can arrange a call with you on current blocker if needed.

miguelofbc commented 4 months ago

What would be a good time to arrange that call?

puagarwa commented 4 months ago

We are in IST time (+5.5), Monday should be good, depend upon your timezone, we can figure out a time. Please send us an email to playwrighttesting@microsoft.com and we will send you a meeting invite, please mention your timezone or preference.

RF2023 commented 4 months ago

@miguelofbc thanks for the follow up. @puagarwa please let us know if any information is required from our side. Thank you !

miguelofbc commented 4 months ago

Hello @vvs11,

As I suspected, I was unable to retrieve any logs from the GitHub Actions for the test execution step due to excessive verbosity and the job crashing before generating logs. I do have logs for all the other steps.

I also tried using Playwright 1.44.1, but it resulted in more failures compared to 1.45.0. The session IDs are as follows:

Setup:

Do you have any suggestions or alternative methods to debug this from your end?

vvs11 commented 4 months ago

Hi @miguelofbc We were able to make some progress in diagnosing this issue. Please allow us some time, we will get back with more info. Thanks