rubycdp / ferrum

Headless Chrome Ruby API
https://ferrum.rubycdp.com
MIT License
1.71k stars 123 forks source link

Same scraping script not work on VPS (test on two VPS, one centos8, one Ubuntu 20.04), but works on local machine (Arch linux). #231

Closed zw963 closed 2 years ago

zw963 commented 2 years ago

This is a so strange issue, for test it, i deploy same script on a new VPS, use Ubuntu 20.04 (former use Centos 8), and both of them can reproduce this issue.

Before start, let me upload a video for describe the scraping web site.

https://user-images.githubusercontent.com/549126/145583251-a81cde9f-58f7-4b95-80d6-f6a1aae86750.mp4

If you can't see it, please download zipped mp4 video from attachment. 2.zip

As you can see, this video start with after log in success, it enter user admin management page. and then, i send a request to https://www.jin10.com and wait the following element appear.

image

You can see it works on my local. (Arch Linux) use ferrum + chrome, in fact, it works when use ferrum + chrome headless too. but same code, not work on my VPS, i test on two vps, one centos 8, one Ubuntu 20.04, both of them get blocked to wait the tabs in above screenshot red box appear.

Following is my scraping code:

options = {
  pending_connection_errors: false,
  window_size: [1600, 900],
  timeout: 30,
  browser_options: { 'no-sandbox': nil, 'blink-settings' => 'imagesEnabled=false', 'start-maximized': true},
  headless: true,
  slowmo: 0.5
}

instance = Ferrum::Browser.new(options)

url = 'https://ucenter.jin10.com'

instance.goto url

# try finding login form
until (form=instance.at_css('#J_loginForm')) &&
    (login=form.at_css('#J_loginPhone')) &&
    login.focusable?

  sleep 0.5
end

# focus on it, typing username, password
login.focus.type(ENV['JIN10_USER'])
form.at_css('#J_loginPassword').focus.type(ENV['JIN10_PASS'])
instance.network.wait_for_idle

# click on login button
form.at_css('button[type="submit"]').click

# wait user admin management page is appear.
until instance.at_css('div.ucenter-menu span.ucenter-menu_title')
  instance.screenshot(path: 'pngs/ccc.png')
  sleep 3
end

# after appear, goto the homepage
url = 'https://www.jin10.com'
instance.goto url

# waiting tabs appear.
while (group_count = instance.css('ul.classify-list li').count) < 2
  sleep 5                         # <= it keep infinite loop here when run on vps.
  instance.screenshot(path: '1.png')
end

I have to admit, headless on local occasionally not work, and headless on VPS, it works several days ago too, it just very very slow when waiting the tabs in most of case if use with chrome headless.

Anyway, please guide me for how to find out where the issue come from. thank you!

zw963 commented 2 years ago

BTW: if there exists some issue when instance visit a a different url use same browser.

e.g.

instance.goto('site1')
# ...
instance.goto('site2')         # <= same browser instance goto another site.

i am curious, goto method if same effect as others methods, like click.

route commented 2 years ago

I'm sorry but all your recent issues are not issues they are things to discuss and questions, so let's use issues properly, please move them to https://github.com/rubycdp/ferrum/discussions

Mifrill commented 2 years ago

@route we can use this for such cases: Selection_999

route commented 2 years ago

Nice I didn't know about this, thanks!