scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.03k stars 113 forks source link

About memory leak #240

Closed Harvi-C closed 1 year ago

Harvi-C commented 1 year ago

yield scrapy.Request(url=ALL_IMAGE_URL + str(page), callback=self.parse, meta=dict( playwright=True, playwright_page_methods=[ PageMethod("evaluate", "window.scrollBy(0, 500)"), PageMethod("wait_for_timeout", timeout), ]

I use the simplest boot method in scrapy, the memory footprint of the machine gets higher and higher, after I crawl about 700 pages, the memory footprint of the machine increased from 4-5GB at the beginning to 18GB, I don't know why

I didn't turn on "playwright_include_page=True", and i set "PLAYWRIGHT_MAX_PAGES_PER_CONTEXT = 16"

So is this a memory leak ? and what do I do about it

image
Harvi-C commented 1 year ago

I didn't turn on "playwright_include_page=True",

So do I need to explicitly call page.close() or page.context.close(), how do I get the page ?

This problem forced me to keep an eye on the machine and manually restart my crawler every once in a while, starting with the page number from the previous collection. It would be a great help if I could get your reply, thanks !

Harvi-C commented 1 year ago

I noticed the problem: https://github.com/microsoft/playwright/issues/6319

elacuesta commented 1 year ago

Should this be closed then?

mikewronski commented 12 months ago

@Harvi-C do you have a solution to this problem? I am also encountering boundless memory in scrapy and playwright, using a single context for all pages and not turning on playwright_include_page=True