Open jamezpolley opened 5 years ago
(repeating comment from #1196)
I've created a very small test "scraper" that doesn't actually scrape, it just checks that mitmproxy is returning a certificate that the system is trusting.
https://morph.io/jamezpolley/ssl_test
Decoded, the CA cert there is:
Issuer: CN = mitmproxy, O = mitmproxy
Validity
Not Before: Mar 30 19:35:55 2018 GMT
Not After : Mar 31 19:35:55 2021 GMT
Subject: CN = mitmproxy, O = mitmproxy
and the certificate for www.yahoo.com is:
Issuer: CN = mitmproxy, O = mitmproxy
Validity
Not Before: Jan 29 05:48:08 2019 GMT
Not After : Jan 30 05:48:08 2024 GMT
Subject: CN = *.www.yahoo.com
X509v3 extensions:
X509v3 Subject Alternative Name:
DNS:*.www.yahoo.com, DNS:add.my.yahoo.com, DNS:*.amp.yimg.com, DNS:au.yahoo.com, DNS:be.yahoo.com, DNS:br.yahoo.com, DNS:ca.my.yahoo.com, DNS:ca.rogers.yahoo.com, DNS:ca.yahoo.com, DNS:ddl.fp.yahoo.com, DNS:de.yahoo.com, DNS:en-maktoob.yahoo.com, DNS:espanol.yahoo.com, DNS:es.yahoo.com, DNS:fr-be.yahoo.com, DNS:fr-ca.rogers.yahoo.com, DNS:frontier.yahoo.com, DNS:fr.yahoo.com, DNS:gr.yahoo.com, DNS:hk.yahoo.com, DNS:hsrd.yahoo.com, DNS:ideanetsetter.yahoo.com, DNS:id.yahoo.com, DNS:ie.yahoo.com, DNS:in.yahoo.com, DNS:it.yahoo.com, DNS:maktoob.yahoo.com, DNS:malaysia.yahoo.com, DNS:mbp.yimg.com, DNS:my.yahoo.com, DNS:nz.yahoo.com, DNS:ph.yahoo.com, DNS:qc.yahoo.com, DNS:ro.yahoo.com, DNS:se.yahoo.com, DNS:sg.yahoo.com, DNS:tw.yahoo.com, DNS:uk.yahoo.com, DNS:us.yahoo.com, DNS:verizon.yahoo.com, DNS:vn.yahoo.com, DNS:www.yahoo.com, DNS:yahoo.com, DNS:za.yahoo.com, DNS:106.10.250.10
So I don't think it's related to https://help.morph.io/t/certificate-verify-failed/338 (because I'm seeing MITMProxy serving the certificate okay)
https://morph.io/wfdd/inatsisartut-scraper seems to be running okay, but that might be because it's reporting that PhantomJS isn't supported rather than doing anything useful:
/app/.heroku/python/lib/python3.6/site-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
https://morph.io/andylolz/example_ruby_chrome_headless_scraper is still failing, with errors from selenium webdriver
I'm wondering if the error might be related to webdriver - that would explain why both python and ruby interfaces to it are broken. It seems like upgrading the version and changing the samples to not use phantomJS might be useful.
The error is unlikely to be related to the web driver as it's also happening in pure Python - see https://morph.io/wfdd/test-scraper.
https://morph.io/wfdd/inatsisartut-scraper seems to be running okay, but that might be because it's reporting that PhantomJS isn't supported rather than doing anything useful:
It's doing nothing because it's receiving a blank page in response. Also, it's reporting that Phantom JS support is deprecated (i.e. it still works).
Testing your faux scraper with ina.gl openssl
can't find a certificate for it even when using SNI: https://morph.io/wfdd/ssl_test. The same appears to be true for other domain names that've been plagued by this issue.
I think a bunch of different errors are getting bundled into this ticket, so can we take a step back for a moment.
As per this comment, I only mentioned https://morph.io/wfdd/inatsisartut-scraper because back in Sept 2018 (when these errors started, following this update) that scraper used chrome headless. In Oct 2018, wfdd/inatsisartut-scraper switched to use phantomJS. So any errors reported on that scraper since Oct 2018 are probably not relevant to this ticket, which is specifically about chrome headless.
Similarly, https://morph.io/wfdd/test-scraper and https://morph.io/wfdd/ssl_test are not using chrome headless, so I guess any errors with those scrapers belong in a new ticket.
So specifically on the error in this ticket…
I'm wondering if the error might be related to webdriver - that would explain why both python and ruby interfaces to it are broken
^^ Yes – this sounds correct to me.
It seems like upgrading the version and changing the samples to not use phantomJS might be useful.
I’m not sure I follow. https://morph.io/andylolz/example_ruby_chrome_headless_scraper doesn’t use phantomJS.
Thanks @wfdd and @andylolz
I'm able to use @wfdd's version of the script to reproduce the problem on my dev instance,
Copied from help.morph.io
A scraper of mine seems to have broken at the same time as this update :frowning:
It fails with a:
It’s using chromedriver in python, as described here: https://morph.io/documentation/scraping_javascript_sites 1
The examples on the docs page above also seem to have broken at the same time, e.g.: https://morph.io/wfdd/inatsisartut-scraper/history
I just tried re-running the really basic example, and that broke in the same way: https://morph.io/andylolz/example_ruby_chrome_headless_scraper
So all this seems to suggest it’s related to this update. What can be done to get this working again? Thanks!