openaustralia / morph

Take the hassle out of web scraping
https://morph.io
GNU Affero General Public License v3.0
463 stars 74 forks source link

Chrome headless scrapers appear broken #1201

Open jamezpolley opened 5 years ago

jamezpolley commented 5 years ago

Copied from help.morph.io

A scraper of mine seems to have broken at the same time as this update :frowning:

It fails with a:

selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: crashed (Driver info: chromedriver=2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7),platform=Linux 4.17.17-x86_64-linode116 x86_64)

It’s using chromedriver in python, as described here: https://morph.io/documentation/scraping_javascript_sites 1

The examples on the docs page above also seem to have broken at the same time, e.g.: https://morph.io/wfdd/inatsisartut-scraper/history

I just tried re-running the really basic example, and that broke in the same way: https://morph.io/andylolz/example_ruby_chrome_headless_scraper

So all this seems to suggest it’s related to this update. What can be done to get this working again? Thanks!

Front logo Front conversations

jamezpolley commented 5 years ago

(repeating comment from #1196)

I've created a very small test "scraper" that doesn't actually scrape, it just checks that mitmproxy is returning a certificate that the system is trusting.

https://morph.io/jamezpolley/ssl_test

Decoded, the CA cert there is:

    Issuer: CN = mitmproxy, O = mitmproxy
    Validity
            Not Before: Mar 30 19:35:55 2018 GMT
            Not After : Mar 31 19:35:55 2021 GMT
    Subject: CN = mitmproxy, O = mitmproxy

and the certificate for www.yahoo.com is:

    Issuer: CN = mitmproxy, O = mitmproxy
    Validity
            Not Before: Jan 29 05:48:08 2019 GMT
            Not After : Jan 30 05:48:08 2024 GMT
    Subject: CN = *.www.yahoo.com
    X509v3 extensions:
            X509v3 Subject Alternative Name:
                    DNS:*.www.yahoo.com, DNS:add.my.yahoo.com, DNS:*.amp.yimg.com, DNS:au.yahoo.com, DNS:be.yahoo.com, DNS:br.yahoo.com, DNS:ca.my.yahoo.com, DNS:ca.rogers.yahoo.com, DNS:ca.yahoo.com, DNS:ddl.fp.yahoo.com, DNS:de.yahoo.com, DNS:en-maktoob.yahoo.com, DNS:espanol.yahoo.com, DNS:es.yahoo.com, DNS:fr-be.yahoo.com, DNS:fr-ca.rogers.yahoo.com, DNS:frontier.yahoo.com, DNS:fr.yahoo.com, DNS:gr.yahoo.com, DNS:hk.yahoo.com, DNS:hsrd.yahoo.com, DNS:ideanetsetter.yahoo.com, DNS:id.yahoo.com, DNS:ie.yahoo.com, DNS:in.yahoo.com, DNS:it.yahoo.com, DNS:maktoob.yahoo.com, DNS:malaysia.yahoo.com, DNS:mbp.yimg.com, DNS:my.yahoo.com, DNS:nz.yahoo.com, DNS:ph.yahoo.com, DNS:qc.yahoo.com, DNS:ro.yahoo.com, DNS:se.yahoo.com, DNS:sg.yahoo.com, DNS:tw.yahoo.com, DNS:uk.yahoo.com, DNS:us.yahoo.com, DNS:verizon.yahoo.com, DNS:vn.yahoo.com, DNS:www.yahoo.com, DNS:yahoo.com, DNS:za.yahoo.com, DNS:106.10.250.10

So I don't think it's related to https://help.morph.io/t/certificate-verify-failed/338 (because I'm seeing MITMProxy serving the certificate okay)

jamezpolley commented 5 years ago

https://morph.io/wfdd/inatsisartut-scraper seems to be running okay, but that might be because it's reporting that PhantomJS isn't supported rather than doing anything useful:

/app/.heroku/python/lib/python3.6/site-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '

https://morph.io/andylolz/example_ruby_chrome_headless_scraper is still failing, with errors from selenium webdriver

I'm wondering if the error might be related to webdriver - that would explain why both python and ruby interfaces to it are broken. It seems like upgrading the version and changing the samples to not use phantomJS might be useful.

wfdd commented 5 years ago

The error is unlikely to be related to the web driver as it's also happening in pure Python - see https://morph.io/wfdd/test-scraper.

wfdd commented 5 years ago

https://morph.io/wfdd/inatsisartut-scraper seems to be running okay, but that might be because it's reporting that PhantomJS isn't supported rather than doing anything useful:

It's doing nothing because it's receiving a blank page in response. Also, it's reporting that Phantom JS support is deprecated (i.e. it still works).

wfdd commented 5 years ago

Testing your faux scraper with ina.gl openssl can't find a certificate for it even when using SNI: https://morph.io/wfdd/ssl_test. The same appears to be true for other domain names that've been plagued by this issue.

andylolz commented 5 years ago

I think a bunch of different errors are getting bundled into this ticket, so can we take a step back for a moment.

As per this comment, I only mentioned https://morph.io/wfdd/inatsisartut-scraper because back in Sept 2018 (when these errors started, following this update) that scraper used chrome headless. In Oct 2018, wfdd/inatsisartut-scraper switched to use phantomJS. So any errors reported on that scraper since Oct 2018 are probably not relevant to this ticket, which is specifically about chrome headless.

Similarly, https://morph.io/wfdd/test-scraper and https://morph.io/wfdd/ssl_test are not using chrome headless, so I guess any errors with those scrapers belong in a new ticket.

So specifically on the error in this ticket…

I'm wondering if the error might be related to webdriver - that would explain why both python and ruby interfaces to it are broken

^^ Yes – this sounds correct to me.

It seems like upgrading the version and changing the samples to not use phantomJS might be useful.

I’m not sure I follow. https://morph.io/andylolz/example_ruby_chrome_headless_scraper doesn’t use phantomJS.

jamezpolley commented 5 years ago

Thanks @wfdd and @andylolz

I'm able to use @wfdd's version of the script to reproduce the problem on my dev instance,