rubycdp / ferrum

Headless Chrome Ruby API
https://ferrum.rubycdp.com
MIT License
1.71k stars 123 forks source link

How to prevent chrome headless use too many memory(memory leak maybe?) #228

Closed zw963 closed 2 years ago

zw963 commented 2 years ago

Hi, following is a chrome headless run on my 2GB memory VPS, basically, it runing 2 minutes, and idle(just run sleep) for 4 minutes, But, after running it several hours, only one chrome headless process get consume 600M+ memory, which make my VPS almost broken.

image

I use capybara + cuprite for this scrap script, i hope can hear some idea, for avoid use too many memory. (BTW: because this script need login, so, login too frequently is not a good solution for this case)

Thank you.

zw963 commented 2 years ago

Oops, i saw some logs, my process is broken because some Capybara::Cuprite::ObsoleteNode or Ferrum::TimeoutError exception, it keep retry. anyway, i want heard the advice from you, thank you.

zw963 commented 2 years ago

It seem like the the main reason chrome headless not work correct is caused by chrome headless not work with websocket

ruby/3.0.0/gems/ferrum-0.11/lib/ferrum/browser/process.rb:149:in `parse_ws_url': Browser did not produce websocket url within 10 seconds, try to increase `:process_timeout`. See https://github.com/rubycdp/ferrum#customization (Ferrum::ProcessTimeoutError)

I try add :process_timeout, no luck, it just keep waiting because some elements which depend on websocket connection never appear.

route commented 2 years ago

Do you use .reset sometimes in your scripts? If not it's good idea to start

route commented 2 years ago

In general an ideal solution is not to start Chrome and create hundreds thousands of pages in it but instead use short lived session to chrome and then kill it. All of course running in containers with limits on mem and cpu. If this is not the option you should definitely call .reset after visiting page or at least kill the whole context as in here https://github.com/rubycdp/ferrum#thread-safety

zw963 commented 2 years ago

Do you use .reset sometimes in your scripts? If not it's good idea to start

No, i don't know what this .reset is means, in fact, i run instance.visit(some_url) several times when do scrap, you know what i means, open another url use same session. (in fact, just direct to main page after login in).

So, what you advice is, if not use multi-thread mode, i can run instance.reset at any time, and not lose my session?

route commented 2 years ago

You should be able to loose session. The session is only a cookie cookie somewhere, run fresh Chrome instance, set cookie visit as you logged in, voila.

zw963 commented 2 years ago

Thank you.