openaustralia / morph

Take the hassle out of web scraping
https://morph.io
GNU Affero General Public License v3.0
463 stars 74 forks source link

Some? HTTPS connections fail #1196

Open wfdd opened 5 years ago

wfdd commented 5 years ago

As was reported in [1] HTTPS requests either fail (as in the case of vanilla Python) or return the exact payload <html xmlns="http://www.w3.org/1999/xhtml"><head></head><body></body></html> (e.g. with Selenium, presumably because the status code is ignored and that's what the blank page was hardcoded to in headless Chrome). This appears to have been happening to inatsisartut-scraper for the past eight months (it slipped under my radar because it did not cause the scraper to fail).

Possibly related: #1201

wfdd commented 5 years ago

The error returned by Python 3.6 when attempting to urlopen('https://ina.gl/inatsisartut/sammensaetning-af-inatsisartut/') is ssl.SSLError: [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:841).

chris48s commented 5 years ago

Last time a similar problem happened, it was an issue with MITM proxy: https://help.morph.io/t/certificate-verify-failed/338

I wonder if it is related again?

jamezpolley commented 5 years ago

Similar to #1201

jamezpolley commented 5 years ago

I've created a very small test "scraper" that doesn't actually scrape, it just checks that mitmproxy is returning a certificate that the system is trusting.

https://morph.io/jamezpolley/ssl_test

Decoded, the CA cert there is:

    Issuer: CN = mitmproxy, O = mitmproxy
    Validity
            Not Before: Mar 30 19:35:55 2018 GMT
            Not After : Mar 31 19:35:55 2021 GMT
    Subject: CN = mitmproxy, O = mitmproxy

and the certificate for www.yahoo.com is:

    Issuer: CN = mitmproxy, O = mitmproxy
    Validity
            Not Before: Jan 29 05:48:08 2019 GMT
            Not After : Jan 30 05:48:08 2024 GMT
    Subject: CN = *.www.yahoo.com
    X509v3 extensions:
            X509v3 Subject Alternative Name:
                    DNS:*.www.yahoo.com, DNS:add.my.yahoo.com, DNS:*.amp.yimg.com, DNS:au.yahoo.com, DNS:be.yahoo.com, DNS:br.yahoo.com, DNS:ca.my.yahoo.com, DNS:ca.rogers.yahoo.com, DNS:ca.yahoo.com, DNS:ddl.fp.yahoo.com, DNS:de.yahoo.com, DNS:en-maktoob.yahoo.com, DNS:espanol.yahoo.com, DNS:es.yahoo.com, DNS:fr-be.yahoo.com, DNS:fr-ca.rogers.yahoo.com, DNS:frontier.yahoo.com, DNS:fr.yahoo.com, DNS:gr.yahoo.com, DNS:hk.yahoo.com, DNS:hsrd.yahoo.com, DNS:ideanetsetter.yahoo.com, DNS:id.yahoo.com, DNS:ie.yahoo.com, DNS:in.yahoo.com, DNS:it.yahoo.com, DNS:maktoob.yahoo.com, DNS:malaysia.yahoo.com, DNS:mbp.yimg.com, DNS:my.yahoo.com, DNS:nz.yahoo.com, DNS:ph.yahoo.com, DNS:qc.yahoo.com, DNS:ro.yahoo.com, DNS:se.yahoo.com, DNS:sg.yahoo.com, DNS:tw.yahoo.com, DNS:uk.yahoo.com, DNS:us.yahoo.com, DNS:verizon.yahoo.com, DNS:vn.yahoo.com, DNS:www.yahoo.com, DNS:yahoo.com, DNS:za.yahoo.com, DNS:106.10.250.10

So I don't think it's related to https://help.morph.io/t/certificate-verify-failed/338 (because I'm seeing MITMProxy serving the certificate okay)

wfdd commented 5 years ago

See #1202 for (misplaced) details on this issue, which is unrelated to the web driver (be it PhantomJS or Chrome).

jamezpolley commented 5 years ago

@wfdd Did you mean #1201 ?

wfdd commented 5 years ago

I did.

reitermarkus commented 5 years ago

I am also getting the unknown protocol error with my Ruby scraper https://morph.io/reitermarkus/heizoelpreise-oesterreich.