Open pjsg opened 10 months ago
I was able to repro in 1 out of 5 runs. However, I was not able to repro with the following snippet. Not yet sure what's going on.
from playwright.async_api import async_playwright
import asyncio
async def doit(url):
print(f"Processing {url}")
async with async_playwright() as p:
browser_type = p.chromium
browser = await browser_type.launch(
headless=True,
)
try:
page = await browser.new_page(
bypass_csp=True,
ignore_https_errors=True,
)
res = await page.goto(url, wait_until="load", timeout=30 * 1000)
await page.wait_for_load_state(state="networkidle")
except Exception as e:
print(f"Got exception {e}")
raise e
finally:
await browser.close()
asyncio.run(doit("https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html"))
It appears that the browser.close()
is the key difference. In @dgozman example, this is executed, whereas in my example it is not executed (as the exception is already thrown). Having said that, if you don't do the close()
then it throws a different exception on other urls: https://cnn.com/
I'm unfortunately not able to reproduce it. I tried to repro running 10 times on macOS with Python 3.10 and Python 3.12.
Closing for now since we can't reproduce it.
I don't think this should be closed. I can reproduce the error. Whenever there is a timeout error it appears that the event loop is closing, resulting in an Invalid state.
In [3]: from playwright.async_api import async_playwright
...: import asyncio
...:
...: async def doit(url):
...: print(f"Processing {url}")
...: try:
...: async with async_playwright() as p:
...:
...: browser_type = p.chromium
...:
...: browser = await browser_type.launch(
...: headless=True,
...: )
...:
...: page = await browser.new_page(
...: bypass_csp=True,
...: ignore_https_errors=True,
...: )
...:
...: res = await page.goto(url, wait_until="load", timeout=30 * 1000)
...:
...: await page.wait_for_load_state(state="networkidle")
...: await browser.close()
...:
...: except Exception as e:
...: print(f"Got exception {e}")
...: raise e
...:
...: asyncio.run(doit("https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html"))
Processing https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html
Got exception Timeout 30000ms exceeded.
---------------------------------------------------------------------------
TimeoutError Traceback (most recent call last)
Cell In[3], line 29
26 print(f"Got exception {e}")
27 raise e
---> 29 asyncio.run(doit("https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html"))
File ~/.pyenv/versions/3.10.6/lib/python3.10/asyncio/runners.py:44, in run(main, debug)
42 if debug is not None:
43 loop.set_debug(debug)
---> 44 return loop.run_until_complete(main)
45 finally:
46 try:
File ~/.pyenv/versions/3.10.6/lib/python3.10/asyncio/base_events.py:646, in BaseEventLoop.run_until_complete(self, future)
643 if not future.done():
644 raise RuntimeError('Event loop stopped before Future completed.')
--> 646 return future.result()
Cell In[3], line 27, in doit(url)
25 except Exception as e:
26 print(f"Got exception {e}")
---> 27 raise e
Cell In[3], line 20, in doit(url)
11 browser = await browser_type.launch(
12 headless=True,
13 )
15 page = await browser.new_page(
16 bypass_csp=True,
17 ignore_https_errors=True,
18 )
---> 20 res = await page.goto(url, wait_until="load", timeout=30 * 1000)
22 await page.wait_for_load_state(state="networkidle")
23 await browser.close()
File ~/Desktop/open-source/playwright-python/playwright/async_api/_generated.py:8612, in Page.goto(self, url, timeout, wait_until, referer)
8551 async def goto(
8552 self,
8553 url: str,
(...)
8559 referer: typing.Optional[str] = None
8560 ) -> typing.Optional["Response"]:
8561 """Page.goto
8562
8563 Returns the main resource response. In case of multiple redirects, the navigation will resolve with the first
(...)
8608 Union[Response, None]
8609 """
8611 return mapping.from_impl_nullable(
-> 8612 await self._impl_obj.goto(
8613 url=url, timeout=timeout, waitUntil=wait_until, referer=referer
8614 )
8615 )
File ~/Desktop/open-source/playwright-python/playwright/_impl/_page.py:500, in Page.goto(self, url, timeout, waitUntil, referer)
493 async def goto(
494 self,
495 url: str,
(...)
498 referer: str = None,
499 ) -> Optional[Response]:
--> 500 return await self._main_frame.goto(**locals_to_params(locals()))
File ~/Desktop/open-source/playwright-python/playwright/_impl/_frame.py:145, in Frame.goto(self, url, timeout, waitUntil, referer)
135 async def goto(
136 self,
137 url: str,
(...)
140 referer: str = None,
141 ) -> Optional[Response]:
142 return cast(
143 Optional[Response],
144 from_nullable_channel(
--> 145 await self._channel.send("goto", locals_to_params(locals()))
146 ),
147 )
File ~/Desktop/open-source/playwright-python/playwright/_impl/_connection.py:59, in Channel.send(self, method, params)
58 async def send(self, method: str, params: Dict = None) -> Any:
---> 59 return await self._connection.wrap_api_call(
60 lambda: self.inner_send(method, params, False)
61 )
File ~/Desktop/open-source/playwright-python/playwright/_impl/_connection.py:509, in Connection.wrap_api_call(self, cb, is_internal)
507 self._api_zone.set(_extract_stack_trace_information_from_stack(st, is_internal))
508 try:
--> 509 return await cb()
510 finally:
511 self._api_zone.set(None)
File ~/Desktop/open-source/playwright-python/playwright/_impl/_connection.py:97, in Channel.inner_send(self, method, params, return_as_dict)
95 if not callback.future.done():
96 callback.future.cancel()
---> 97 result = next(iter(done)).result()
98 # Protocol now has named return values, assume result is one level deeper unless
99 # there is explicit ambiguity.
100 if not result:
TimeoutError: Timeout 30000ms exceeded.
I am facing a similar problem with my scraper as well. The entire code base is really large so I can't post it here. The scraper is supposed to scrape about 1400+ pages, and each page has a timeout of about 10 seconds. The process should take about 12+ hours without any errors.
Where this error happens isn't exactly consistent, but it seems to occur somewhere after about 3 hours of scraping, at around 350 links. It only throws the error when I stop the python programme, and does not stop the python file automatically like an error.
Some measures taken to workaround:
Edit: Happens on Python 3.10 on MacOS and Python 3.11 on Windows.
Another stacktrace:
.venv/lib/python3.11/site-packages/playwright/_impl/_connection.py:296, in Connection.cleanup(self, cause)
294 ws_connection._transport.dispose()
295 for callback in self._callbacks.values():
--> 296 callback.future.set_exception(self._closed_error)
297 self._callbacks.clear()
298 self.emit("close")
With anyio
:
async with (
async_playwright() as p,
create_task_group() as tg
):
browser = await p.chromium.launch()
list_spider = await SpiderAPI[ListingLink, ListPageLink].create(browser)
tg.start_soon(list_spider.run, spider_list(config)) # curried
await sleep(5)
tg.cancel_scope.cancel()
also facing this issue:
async with async_playwright() as playwright:
browser = await playwright.chromium.launch(headless=True)
await asyncio.gather(
*[
execute_in_task(
settings, producer, session_factory, browser, shutdown_event, i
)
for i in range(settings.max_workers)
],
return_exceptions=True,
)
await browser.close()
Process SpawnProcess-3: Traceback (most recent call last): File "/Users/awtkns/PycharmProjects/deworkd/python/deworker/deworker/main.py", line 142, in loop await browser.close() File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/async_api/_generated.py", line 14581, in close return mapping.from_maybe_impl(await self._impl_obj.close(reason=reason)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/_impl/_browser.py", line 189, in close raise e File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/_impl/_browser.py", line 186, in close await self._channel.send("close", {"reason": reason}) File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 63, in send return await self._connection.wrap_api_call( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 495, in wrap_api_call return await cb() ^^^^^^^^^^ File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 101, in inner_send result = next(iter(done)).result() ^^^^^^^^^^^^^^^^^^^^^^^^^ Exception: Connection closed while reading from the driver
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/Users/awtkns/PycharmProjects/deworkd/python/deworker/deworker/main.py", line 154, in main asyncio.run(loop(settings)) File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/runners.py", line 190, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "/Users/awtkns/PycharmProjects/deworkd/python/deworker/deworker/main.py", line 130, in loop async with semaphore, async_playwright() as playwright: File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/async_api/_context_manager.py", line 58, in aexit await self._connection.stop_async() File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 289, in stop_async self.cleanup() File "/Users/awtkns/Library/Caches/pypoetry/virtualenvs/deworkd-77yazfm4-py3.11/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 300, in cleanup callback.future.set_exception(self._closed_error) asyncio.exceptions.InvalidStateError: invalid state
I have just randomly encountered a very similar bug just in almost bare asyncio with Python 3.11 without playwright or any other significant library. With that, this may very easily be an asyncio bug itself. Gonna check more and return back as soon as i find more.
Any update I have been facing the same invalidate state error.
The unresolved and closed issue #2612 may explain this bug. As mentioned,
Without calling
drain()
, there’s a risk that data written to the subprocess may remain in the internal buffer and not be sent, potentially leading to data loss or communication issues.
And drain()
method goes here playwright-python/playwright/_impl/_transport.py, specifically, PipeTransport.send
I suspect this is related, as the underlying issue appears to be a TimeOutError
in the subprocess connection, specifically, here:
playwright-python/playwright/_impl/_connection.py:97, in Channel.inner_send(self, method, params, return_as_dict)
(as shown above in my repro)
I will look into this further.
System info
Source code
Steps
Expected
It should complete without error.
Actual