medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
329 stars 59 forks source link

[SSL] URL blocks in Hyphe only #173

Closed jacomyma closed 7 years ago

jacomyma commented 8 years ago

https://www.oscaro.be/fr/ https://www.dataprivacyandsecurityinsider.com/

These URL are valid in curl and usual browsers, but identified as "Bad gateway 502" as a start page in Hyphe.

boogheta commented 8 years ago

Seems like an SSL issue due to some unmaintained system libraries on Hyphe's server curl -I https://www.oscaro.be/fr/ fails on jrrr (centos 6.4) returning curl: (35) SSL connect error whereas it succeeds as 200 on my own local machine @jri-sp do you know if there are some system updates we could run on all our servers to fix such issues?

jri-sp commented 8 years ago

The problem is probably related to the web server only supporting cipher suite TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, and looks like curl does not enable those ciphers by default.

This command should work: curl -I --ciphers ecdhe_ecdsa_aes_256_sha https://www.oscaro.be/fr/

I think it's possible to specify which ciphers you want with CURLOPT_SSL_CIPHER_LIST option but it could break other sites if you miss some cipher.

Updating curl to a newer version could be a solution, but for CentOS 6 we have to use unstable repositories. (And it probably don't solve the Hyphe part ?)

You can see default ciphers enabled on curl with this: curl https://www.howsmyssl.com/a/check

Hope this helps...

boogheta commented 8 years ago

well we don't really care for curl to be able to process it in the end but we want to make sure the good certificates/ciphers are used by python in scrapyd and twisted. I'll run more tests to see where we fail (maybe the proxy also)

boogheta commented 7 years ago

fixed at least in docker