mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Is crawler4j support crawling https page #174

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.use sample code as below, https page cannot be crawled and no exceptions are 
thrown.

CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
config.setIncludeHttpsPages(true);

What is the expected output? What do you see instead?

What version of the product are you using?
3.3

Please provide any additional information below.

Original issue reported on code.google.com by yaoancheng@gmail.com on 20 Sep 2012 at 12:14

GoogleCodeExporter commented 9 years ago
i already fix this problem. the origin code of PageFetcher fail to support 
crawling https pages.

this is how i fix it,using another fetcher and register a https Scheme, which 
is reference in the following url, but i do some modification from that.

http://stackoverflow.com/questions/2703161/how-to-ignore-ssl-certificate-errors-
in-apache-httpclient-4-0

import org.apache.http.conn.scheme.Scheme;

import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;

public class MyFetcher extends PageFetcher {

    public MyFetcher(CrawlConfig config) {
        super(config);

        if (config.isIncludeHttpsPages()) {
            try {
                httpClient.getConnectionManager().getSchemeRegistry()
                        .unregister("https");
                httpClient.getConnectionManager().getSchemeRegistry().register(
                        new Scheme("https", 443, new MockSSLSocketFactory()));
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

}

Original comment by yaoancheng@gmail.com on 20 Sep 2012 at 2:55

GoogleCodeExporter commented 9 years ago
MockSSLSocketFactory.java

import java.io.IOException;
import java.security.KeyManagementException;
import java.security.KeyStoreException;
import java.security.NoSuchAlgorithmException;
import java.security.UnrecoverableKeyException;
import java.security.cert.CertificateException;

import javax.net.ssl.SSLException;
import javax.net.ssl.SSLSession;
import javax.net.ssl.SSLSocket;

import org.apache.http.conn.ssl.SSLSocketFactory;
import org.apache.http.conn.ssl.TrustStrategy;
import org.apache.http.conn.ssl.X509HostnameVerifier;

public class MockSSLSocketFactory extends SSLSocketFactory {

    public MockSSLSocketFactory() throws NoSuchAlgorithmException,
            KeyManagementException, KeyStoreException,
            UnrecoverableKeyException {
        super(trustStrategy, hostnameVerifier);
    }

    private static final X509HostnameVerifier hostnameVerifier = new X509HostnameVerifier() {
        @Override
        public void verify(String host, SSLSocket ssl) throws IOException {
            // Do nothing
        }

        @Override
        public void verify(String host, String[] cns, String[] subjectAlts)
                throws SSLException {
            // Do nothing
        }

        @Override
        public boolean verify(String s, SSLSession sslSession) {
            return true;
        }

        @Override
        public void verify(String arg0, java.security.cert.X509Certificate arg1)
                throws SSLException {
            // TODO Auto-generated method stub

        }
    };

    private static final TrustStrategy trustStrategy = new TrustStrategy() {

        @Override
        public boolean isTrusted(java.security.cert.X509Certificate[] arg0,
                String arg1) throws CertificateException {
            return true;
        }
    };
}

Original comment by yaoancheng@gmail.com on 20 Sep 2012 at 2:56

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:27

GoogleCodeExporter commented 9 years ago
Looks like a good solution.

Still doesn't work for all cases as seen in issue: 286

Original comment by avrah...@gmail.com on 2 Sep 2014 at 12:27

GoogleCodeExporter commented 9 years ago
I would suggest having another look here for a better solution maybe:
http://stackoverflow.com/questions/2703161/how-to-ignore-ssl-certificate-errors-
in-apache-httpclient-4-0

Original comment by avrah...@gmail.com on 15 Sep 2014 at 2:24

GoogleCodeExporter commented 9 years ago
Fixed at rev: a96701fed185  

I have chosen a different and shorter approach (clearer by my estimation)

Original comment by avrah...@gmail.com on 15 Sep 2014 at 2:33