Closed phongtnit closed 1 year ago
There is no need to patch the handler code, closing a context can be done using the existing API. I understand it might seem a bit verbose but I don't want to create a whole DSL around this to handle context/page creation/deletion.
The new error is because you're trying to download pages with an already closed context, which makes sense if you're closing the context immediately after downloading each page. It's hard to say without knowing exactly what self.get_domain
returns (I suppose something involving urllib.parse.urlparse(url).netloc
, but I'm just guessing), but I suspect you might have some urls in your list that correspond to the same domain(s). I think you could probably get a good performance by grouping URLs in batches (let's say, 1K per context) and closing each context after that, but that might be too complex; a quick solution to download one response per domain and have non-clashing contexts would be to pass a uuid.uuid4()
object as context name for each URL.
Given that the underlying Allocation failed - JavaScript heap out of memory
seems to be an upstream issue, I don't see much else we can do on this side to prevent it.
Hmm, I got the same error after a few hours when scraping just a single domain. Could it be related to error #15 which pops up a fair bit? Any way I can increase the memory heap?
Context '1': new page created, page count is 1 (1 for all contexts)
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
1: 0xa18150 node::Abort() [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
2: 0xa1855c node::OnFatalError(char const, char const) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
3: 0xb9715e v8::Utils::ReportOOMFailure(v8::internal::Isolate, char const, bool) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
4: 0xb974d9 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate, char const, bool) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
5: 0xd54755 [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
6: 0xd650a8 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
7: 0xd2bd9d v8::internal::Factory::NewFixedArrayWithFiller(v8::internal::RootIndex, int, v8::internal::Object, v8::internal::AllocationType) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
8: 0xd2be90 v8::internal::Handle
Are you using a single context for this domain? If so, you're falling into https://github.com/microsoft/playwright/issues/6319.
This seems like an issue on the Node.js side of things. I'm no JS developer, so take the following with a grain of salt, but from what I've found you should be able to increase the memory limit by setting NODE_OPTIONS=--max-old-space-size=<size>
as an environment variable.
Sources and further reading:
Thank you, setting the NODE_OPTIONS seems to have solved the memory issue and It can run for 24h+ without crashing in a single context.
Thank you, setting the NODE_OPTIONS seems to have solved the memory issue and It can run for 24h+ without crashing in a single context.
Hi @xanrag How did you do to fix the JavaScript heap out of memory
error? Which options do you setup?
Hi @xanrag How did you do to fix the
JavaScript heap out of memory
error? Which options do you setup?
Just the memory setting, I added this to my docker-compose and it seems to work: environment:
Hi @xanrag How did you do to fix the
JavaScript heap out of memory
error? Which options do you setup?Just the memory setting, I added this to my docker-compose and it seems to work: environment:
- NODE_OPTIONS=--max-old-space-size=8192
Thanks @xanrag I will try to test my script with new env setting.
@xanrag Hi, did you get "Aborted (core dumped)" error anymore?
I added export NODE_OPTIONS=--max-old-space-size=8192
in ~/.profile
file and run Scrapy script. However, the error Aborted (core dumped)
still occurs when Scrapy Playwright crawled more than 10k urls, sometime about 100k urls.
@phongtnit
I met the issue too. so created context per page and close page and context at the same time like you.
but I faced to different issue about chrome process fork error over 7000 pages. i am searching it now.
@xanrag Hi, did you get "Aborted (core dumped)" error anymore?
I added
export NODE_OPTIONS=--max-old-space-size=8192
in~/.profile
file and run Scrapy script. However, the errorAborted (core dumped)
still occurs when Scrapy Playwright crawled more than 10k urls, sometime about 100k urls.
@xanrag Hi, did you get "Aborted (core dumped)" error anymore?
I'm not sure. When I run scrapy in celery as a separate process it doesn't log to the file when it crashes. There is something still going on though because ocassionally it stops and keeps putting out the same page/item count indefinitely without stopping and I have another issue where it doesn't kill the chrome process correctly but I'll investigate more and start another issue for that if I find anything. (A week of use spawned a quarter of a million zombie processes...)
9fe18b5e9363ed87afca04eb3dda8bf2679ef938
@elacuesta hey I'm having this problem were my computer starts freezing after 1/2hours of running my crawler. I'm pretty sure it's due to this playwright issue you linked (https://github.com/microsoft/playwright/issues/6319) where it's taking up more and more memory. It seems like a workaround is to recreate the page every x minutes but I'm not sure how to do this.
I'm already doing all playwright requests with playwright_context="new"
and that doesn't fix it.
I'm new to this, can you give me pointers on how I can create a new page or context (?) every x minutes? I'm currently unable to figure this out from the documentation on my own.
I've added my spider in case you're interested
Passing playwright_context="new"
for all requests will not make a new context for each request, it will only make all requests go trough a single context named "new".
I'd recommend generating randomly named contexts, maybe using random
or uuid
. That said, one context per request is probably too much, perhaps a good middle point would be one context for each listing page and its derived links, i.e. use the same context for response.follow
calls but generate a new one for the requests to increment the listing page number.
@elacuesta Oh ok, good idea. Thanks!
After looking online I'm not 100% sure whether I have to close a context manually or if just using a new playwright_context="new-name"
is enough? If I have to close it manually, can you point me to the documentation about this?
If I have to close it manually, can you point me to the documentation about this?
https://github.com/scrapy-plugins/scrapy-playwright#closing-a-context-during-a-crawl
Hi,
This issue related to #18
The error still occurred with
scrapy-playwright 0.0.4
. The Scrapy script crawled about 2500 domains in 10k from majestic and crashed with the last errorJavaScript heap out of memory
. So I think this is a bug.My main code:
My env:
The detail of error:
Temporary fix: I replaced line 166 with
await page.context.close()
to close current context in handler.py because my script had one context per one domain. It will fix the errorAllocation failed - JavaScript heap out of memory
and the Scrapy script crawled all 10k domains, but the successful rate was about 72% in comparison with no added code (about 85% successful rate). Also, when I added the new code, the new error was: