ploomber / jupysql-plugin

BSD 3-Clause "New" or "Revised" License
16 stars 12 forks source link

CI became slow #78

Open edublancas opened 1 year ago

edublancas commented 1 year ago

this commit increased the CI runtime (see this previous commit as a comparison).

we need to check what happened. I believe the commit only added new tests, so it'd be surprising that just adding a few extra tests would increase the runtime so much, but I don't see any configuration changes.

if adding new tests slowed the CI so much, we need to find a way to optimize it.

edublancas commented 1 year ago

also, I noticed that the CI became unreliable and it fails frequently due to random errors - hence, I need to restart the CI 2-3 times for it to pass

marshallwhiteorg commented 1 year ago

@edublancas @bryannho It looks like the intermittent failures are due to tests randomly reaching the timeout. I can think of a couple short-term solutions to get the CI passing reliably without addressing the underlying issue.

We could increase the timeout in playwright.config.js like

module.exports = {
  ...baseConfig,
  timeout: 120 * 1000,
  webServer: {
...

It's 60s right now.

This might not actually work if the tests that timeout would never actually finish even with a longer timeout. TBD.

We could also leave the timeout at 60s, or even decrease it, and then somehow mark some subset of the tests to ignore that error.

Regarding the underlying issue of the CI being slow, we added a bunch of tests that have this property that they randomly take a while and time out, and we might have also contributed the code that caused the tests to have that property. I have not been able to reproduce it locally, I don't get the timeouts. It looks like other people have had this issue: https://github.com/microsoft/playwright/issues/13090

To give an example of the unreliable behavior, the test that timed out in the CI linked above as "commit" (CI, code change) which is tests-completer-test-options-displayed-on-tab, doesn't always take a long time. In that instance it reached the 60s timeout, and in the log the time delta between these two lines

[WebServer] [I 2023-09-15 15:35:23.875 ServerApp] Creating new notebook in /tests-completer-test-options-displayed-on-tab
...
[WebServer] [W 2023-09-15 15:36:15.640 ServerApp] delete /tests-completer-test-options-displayed-on-tab/sample.ipynb

is like 52s. So I assume that's part of the 60s that it's using.

But in this CI from yesterday https://github.com/ploomber/jupysql-plugin/actions/runs/6383729916/job/17324954125 the test doesn't fail and the time delta between those lines

[WebServer] [I 2023-10-02 18:07:48.360 ServerApp] Creating new notebook in /tests-completer-test-options-displayed-on-tab
...
[WebServer] [W 2023-10-02 18:07:53.720 ServerApp] delete /tests-completer-test-options-displayed-on-tab/sample.ipynb

is about 5s.

Also there are tests that timed out in that CI from yesterday where the tests were written a month ago, so apparently they didn't used to have this property where they randomly time out.

Can someone more familiar with the code look at the commit and see if issues commented in https://github.com/microsoft/playwright/issues/13090 are being introduced?

bryannho commented 1 year ago

Can someone more familiar with the code look at the commit and see if issues commented in https://github.com/microsoft/playwright/issues/13090 are being introduced?

Thanks Marshall, I can take a look although I am also somewhat new to this code base- @edublancas let us know if you have specific notes on how to proceed

edublancas commented 1 year ago

@bryannho @marshallwhiteorg please work together on this. Sadly, I don't have much to say here: I'm new to javascript/typescript and this repository was mostly created from a template, so we're just using whatever configuration we inherited from the template

bryannho commented 1 year ago

@marshallwhiteorg I was able to look into this commit - it does display the same behaviors detailed in https://github.com/microsoft/playwright/issues/13090.

Interestingly, the commit already employs some of the fixes that other users mention in that issue - yet it still runs into the timeout error. Other users experienced the same issue with no major solution.

I will continue investigating, but for now, I think it's worth it to try increasing the timeout in playwright.config.js as you described.

No guarantee this will work consistently given we can't locate the reason playwright is hanging in the first place. This also may cause the tests to run even slower. In the long run, we may need to restructure all of the tests to optimize runtime and avoid the timeout issue.

marshallwhiteorg commented 1 year ago

@bryannho Okay, I'll make a PR to increase the timeout for now

marshallwhiteorg commented 1 year ago

@bryannho https://github.com/ploomber/jupysql-plugin/pull/86

marshallwhiteorg commented 1 year ago

@bryannho It still timed out with the 2m timeout, so will need a different approach https://github.com/ploomber/jupysql-plugin/actions/runs/6401180754/job/17378192452

bryannho commented 1 year ago

@marshallwhiteorg Okay, so we'll have to get to the bottom of why the tests are hanging.

I noticed that usually the timeout occurs at tests/completer.test.ts:4:5 › test options displayed on tab.

I've been looking into this test to see if it might be one of the causes. Will do some more googling and CI runs to see if we can garner any more information.

bryannho commented 1 year ago

In addition to this error I've ran into a new one:

1) tests/jupysql_plugin.test.ts:79:13 › jupysql-plugin ui-test › jupysql-plugin ui-test for notebook: "input-syntax-highlighting.ipynb" 

    Error: page.evaluate: Execution context was destroyed, most likely because of a navigation.

      24 |
      25 |     await page.notebook.openByPath(notebook);
    > 26 |     await page.notebook.activate(notebook);
         |     ^
      27 |
      28 |     let numCellImages = 0;
      29 |

I got here by commenting out tests/completer.test.ts completely. I'm not sure if this error is a symptom of the same issue or a completely different one. Worth looking into

bryannho commented 1 year ago

@marshallwhiteorg @edublancas After debugging the CI on my own fork I have not been able to come up with a viable solution. Some steps I took:

  1. Recording a trace to investigate specific instances where the test hangs
  2. Implementing waitForSelector or waitForNavigation to ensure the test waits for necessary items to load before performing further actions
  3. Commenting out specific tests that fail frequently
  4. Combing through the code to check for incorrect uses of async/await or other small errors
  5. Modifying the CI to re-try failed checks multiple times

Isolating the problem has been difficult. While debugging, I've found that when one test succeeds/is commented out, a completely different test fails or a new error arises. Overall the tests are flaky and unreliable.

This leads me to believe there is some race condition that is causing the UI tests to fall out of sync/components to hang but I haven't been able to get deeper than that. My best guess is a resource issue with the CI or that our tests are structured in a way that encourages these race conditions (incorrect use of async/await or page).

For now, without getting to the root of the issue, it might be helpful to configure the CI/playwright to re-try failed tests 2-3 times each.

Perhaps someone with more familiarity/bandwidth could review the testing strategies we use in detail. But to best utilize time I'll move onto other issues for now.

Here were the most helpful resources I found:

edublancas commented 1 year ago

@bryannho sure, I can assign a few other things.

@marshallwhiteorg what's the conclusion on this?

marshallwhiteorg commented 1 year ago

@edublancas @bryannho I suggest we reduce the timeout as much as we can by measuring the max duration of the successful tests and then implement the retry mechanism. How does that sound?

edublancas commented 1 year ago

yeah sure go ahead