microsoft / playwright-python

Python version of the Playwright testing and automation library.
https://playwright.dev/python/
Apache License 2.0
11.96k stars 919 forks source link

[BUG] asyncio.exceptions.InvalidStateError: invalid state thrown by exit in async context manager #2238

Open pjsg opened 10 months ago

pjsg commented 10 months ago

System info

Source code

from playwright.async_api import async_playwright
import asyncio

async def doit(url):
    print(f"Processing {url}")
    try:
        async with async_playwright() as p:

                browser_type = p.chromium

                browser = await browser_type.launch(
                    headless=True,
                )

                page = await browser.new_page(
                    bypass_csp=True,
                    ignore_https_errors=True,
                )

                res = await page.goto(url, wait_until="load", timeout=30 * 1000)

                await page.wait_for_load_state(state="networkidle")
                await browser.close()

    except Exception as e:
        print(f"Got exception {e}")
        raise e

asyncio.run(doit("https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html"))

Steps

Expected

It should complete without error.

Actual

Processing https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html
Got exception invalid state
Traceback (most recent call last):
  File "/Users/philip/play-dir/playtest.py", line 22, in doit
    await page.wait_for_load_state(state="networkidle")
  File "/Users/philip/.pyenv/versions/play-dir/lib/python3.10/site-packages/playwright/async_api/_generated.py", line 9367, in wait_for_load_state
    await self._impl_obj.wait_for_load_state(state=state, timeout=timeout)
  File "/Users/philip/.pyenv/versions/play-dir/lib/python3.10/site-packages/playwright/_impl/_page.py", line 491, in wait_for_load_state
    return await self._main_frame.wait_for_load_state(**locals_to_params(locals()))
  File "/Users/philip/.pyenv/versions/play-dir/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 237, in wait_for_load_state
    return await self._wait_for_load_state_impl(state, timeout)
  File "/Users/philip/.pyenv/versions/play-dir/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 265, in _wait_for_load_state_impl
    await waiter.result()
playwright._impl._errors.TimeoutError: Timeout 30000ms exceeded.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/philip/play-dir/playtest.py", line 29, in <module>
    asyncio.run(doit("https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html"))
  File "/Users/philip/.pyenv/versions/3.10.7/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/Users/philip/.pyenv/versions/3.10.7/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()
  File "/Users/philip/play-dir/playtest.py", line 27, in doit
    raise e
  File "/Users/philip/play-dir/playtest.py", line 7, in doit
    async with async_playwright() as p:
  File "/Users/philip/.pyenv/versions/play-dir/lib/python3.10/site-packages/playwright/async_api/_context_manager.py", line 58, in __aexit__ 
    await self._connection.stop_async()
  File "/Users/philip/.pyenv/versions/play-dir/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 288, in stop_async
    self.cleanup()
  File "/Users/philip/.pyenv/versions/play-dir/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 299, in cleanup
    callback.future.set_exception(self._closed_error)
asyncio.exceptions.InvalidStateError: invalid state
dgozman commented 10 months ago

I was able to repro in 1 out of 5 runs. However, I was not able to repro with the following snippet. Not yet sure what's going on.

from playwright.async_api import async_playwright
import asyncio

async def doit(url):
    print(f"Processing {url}")

    async with async_playwright() as p:
        browser_type = p.chromium
        browser = await browser_type.launch(
            headless=True,
        )

        try:
            page = await browser.new_page(
                bypass_csp=True,
                ignore_https_errors=True,
            )
            res = await page.goto(url, wait_until="load", timeout=30 * 1000)
            await page.wait_for_load_state(state="networkidle")
        except Exception as e:
            print(f"Got exception {e}")
            raise e
        finally:
            await browser.close()

asyncio.run(doit("https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html"))
pjsg commented 10 months ago

It appears that the browser.close() is the key difference. In @dgozman example, this is executed, whereas in my example it is not executed (as the exception is already thrown). Having said that, if you don't do the close() then it throws a different exception on other urls: https://cnn.com/

mxschmitt commented 9 months ago

I'm unfortunately not able to reproduce it. I tried to repro running 10 times on macOS with Python 3.10 and Python 3.12.

mxschmitt commented 9 months ago

Closing for now since we can't reproduce it.

danphenderson commented 9 months ago

I don't think this should be closed. I can reproduce the error. Whenever there is a timeout error it appears that the event loop is closing, resulting in an Invalid state.

In [3]: from playwright.async_api import async_playwright
   ...: import asyncio
   ...:
   ...: async def doit(url):
   ...:     print(f"Processing {url}")
   ...:     try:
   ...:         async with async_playwright() as p:
   ...:
   ...:                 browser_type = p.chromium
   ...:
   ...:                 browser = await browser_type.launch(
   ...:                     headless=True,
   ...:                 )
   ...:
   ...:                 page = await browser.new_page(
   ...:                     bypass_csp=True,
   ...:                     ignore_https_errors=True,
   ...:                 )
   ...:
   ...:                 res = await page.goto(url, wait_until="load", timeout=30 * 1000)
   ...:
   ...:                 await page.wait_for_load_state(state="networkidle")
   ...:                 await browser.close()
   ...:
   ...:     except Exception as e:
   ...:         print(f"Got exception {e}")
   ...:         raise e
   ...:
   ...: asyncio.run(doit("https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html"))
Processing https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html
Got exception Timeout 30000ms exceeded.
---------------------------------------------------------------------------
TimeoutError                              Traceback (most recent call last)
Cell In[3], line 29
     26         print(f"Got exception {e}")
     27         raise e
---> 29 asyncio.run(doit("https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html"))

File ~/.pyenv/versions/3.10.6/lib/python3.10/asyncio/runners.py:44, in run(main, debug)
     42     if debug is not None:
     43         loop.set_debug(debug)
---> 44     return loop.run_until_complete(main)
     45 finally:
     46     try:

File ~/.pyenv/versions/3.10.6/lib/python3.10/asyncio/base_events.py:646, in BaseEventLoop.run_until_complete(self, future)
    643 if not future.done():
    644     raise RuntimeError('Event loop stopped before Future completed.')
--> 646 return future.result()

Cell In[3], line 27, in doit(url)
     25 except Exception as e:
     26     print(f"Got exception {e}")
---> 27     raise e

Cell In[3], line 20, in doit(url)
     11 browser = await browser_type.launch(
     12     headless=True,
     13 )
     15 page = await browser.new_page(
     16     bypass_csp=True,
     17     ignore_https_errors=True,
     18 )
---> 20 res = await page.goto(url, wait_until="load", timeout=30 * 1000)
     22 await page.wait_for_load_state(state="networkidle")
     23 await browser.close()

File ~/Desktop/open-source/playwright-python/playwright/async_api/_generated.py:8612, in Page.goto(self, url, timeout, wait_until, referer)
   8551 async def goto(
   8552     self,
   8553     url: str,
   (...)
   8559     referer: typing.Optional[str] = None
   8560 ) -> typing.Optional["Response"]:
   8561     """Page.goto
   8562
   8563     Returns the main resource response. In case of multiple redirects, the navigation will resolve with the first
   (...)
   8608     Union[Response, None]
   8609     """
   8611     return mapping.from_impl_nullable(
-> 8612         await self._impl_obj.goto(
   8613             url=url, timeout=timeout, waitUntil=wait_until, referer=referer
   8614         )
   8615     )

File ~/Desktop/open-source/playwright-python/playwright/_impl/_page.py:500, in Page.goto(self, url, timeout, waitUntil, referer)
    493 async def goto(
    494     self,
    495     url: str,
   (...)
    498     referer: str = None,
    499 ) -> Optional[Response]:
--> 500     return await self._main_frame.goto(**locals_to_params(locals()))

File ~/Desktop/open-source/playwright-python/playwright/_impl/_frame.py:145, in Frame.goto(self, url, timeout, waitUntil, referer)
    135 async def goto(
    136     self,
    137     url: str,
   (...)
    140     referer: str = None,
    141 ) -> Optional[Response]:
    142     return cast(
    143         Optional[Response],
    144         from_nullable_channel(
--> 145             await self._channel.send("goto", locals_to_params(locals()))
    146         ),
    147     )

File ~/Desktop/open-source/playwright-python/playwright/_impl/_connection.py:59, in Channel.send(self, method, params)
     58 async def send(self, method: str, params: Dict = None) -> Any:
---> 59     return await self._connection.wrap_api_call(
     60         lambda: self.inner_send(method, params, False)
     61     )

File ~/Desktop/open-source/playwright-python/playwright/_impl/_connection.py:509, in Connection.wrap_api_call(self, cb, is_internal)
    507 self._api_zone.set(_extract_stack_trace_information_from_stack(st, is_internal))
    508 try:
--> 509     return await cb()
    510 finally:
    511     self._api_zone.set(None)

File ~/Desktop/open-source/playwright-python/playwright/_impl/_connection.py:97, in Channel.inner_send(self, method, params, return_as_dict)
     95 if not callback.future.done():
     96     callback.future.cancel()
---> 97 result = next(iter(done)).result()
     98 # Protocol now has named return values, assume result is one level deeper unless
     99 # there is explicit ambiguity.
    100 if not result:

TimeoutError: Timeout 30000ms exceeded.
yijiyap commented 7 months ago

I am facing a similar problem with my scraper as well. The entire code base is really large so I can't post it here. The scraper is supposed to scrape about 1400+ pages, and each page has a timeout of about 10 seconds. The process should take about 12+ hours without any errors.

Where this error happens isn't exactly consistent, but it seems to occur somewhere after about 3 hours of scraping, at around 350 links. It only throws the error when I stop the python programme, and does not stop the python file automatically like an error.

Some measures taken to workaround:

Edit: Happens on Python 3.10 on MacOS and Python 3.11 on Windows.

haf commented 7 months ago

Another stacktrace:

 .venv/lib/python3.11/site-packages/playwright/_impl/_connection.py:296, in Connection.cleanup(self, cause)
     294     ws_connection._transport.dispose()
     295 for callback in self._callbacks.values():
 --> 296     callback.future.set_exception(self._closed_error)
     297 self._callbacks.clear()
     298 self.emit("close")

With anyio:

async with (
    async_playwright() as p,
    create_task_group() as tg
):
    browser = await p.chromium.launch()
    list_spider = await SpiderAPI[ListingLink, ListPageLink].create(browser)
    tg.start_soon(list_spider.run, spider_list(config)) # curried
    await sleep(5)
    tg.cancel_scope.cancel()
awtkns commented 5 months ago

also facing this issue:

   async with async_playwright() as playwright:
            browser = await playwright.chromium.launch(headless=True)

            await asyncio.gather(
                *[
                    execute_in_task(
                        settings, producer, session_factory, browser, shutdown_event, i
                    )
                    for i in range(settings.max_workers)
                ],
                return_exceptions=True,
            )
            await browser.close()

Process SpawnProcess-3: Traceback (most recent call last): File "/Users/awtkns/PycharmProjects/deworkd/python/deworker/deworker/main.py", line 142, in loop await browser.close() File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/async_api/_generated.py", line 14581, in close return mapping.from_maybe_impl(await self._impl_obj.close(reason=reason)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/_impl/_browser.py", line 189, in close raise e File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/_impl/_browser.py", line 186, in close await self._channel.send("close", {"reason": reason}) File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 63, in send return await self._connection.wrap_api_call( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 495, in wrap_api_call return await cb() ^^^^^^^^^^ File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 101, in inner_send result = next(iter(done)).result() ^^^^^^^^^^^^^^^^^^^^^^^^^ Exception: Connection closed while reading from the driver

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/Users/awtkns/PycharmProjects/deworkd/python/deworker/deworker/main.py", line 154, in main asyncio.run(loop(settings)) File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/runners.py", line 190, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "/Users/awtkns/PycharmProjects/deworkd/python/deworker/deworker/main.py", line 130, in loop async with semaphore, async_playwright() as playwright: File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/async_api/_context_manager.py", line 58, in aexit await self._connection.stop_async() File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 289, in stop_async self.cleanup() File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 300, in cleanup callback.future.set_exception(self._closed_error) asyncio.exceptions.InvalidStateError: invalid state

marenamat commented 4 months ago

I have just randomly encountered a very similar bug just in almost bare asyncio with Python 3.11 without playwright or any other significant library. With that, this may very easily be an asyncio bug itself. Gonna check more and return back as soon as i find more.

ghost commented 2 months ago

Any update I have been facing the same invalidate state error.

danphenderson commented 3 weeks ago

The unresolved and closed issue #2612 may explain this bug. As mentioned,

Without calling drain(), there’s a risk that data written to the subprocess may remain in the internal buffer and not be sent, potentially leading to data loss or communication issues.

And drain() method goes here playwright-python/playwright/_impl/_transport.py, specifically, PipeTransport.send

I suspect this is related, as the underlying issue appears to be a TimeOutError in the subprocess connection, specifically, here: playwright-python/playwright/_impl/_connection.py:97, in Channel.inner_send(self, method, params, return_as_dict) (as shown above in my repro)

I will look into this further.