jesbin / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Proxy information get lost when using basic authentication #330

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Set proxy settings in CrawlConfig
2. Add BasicAuthInfo to CrawlConfig
3. Try to crawl a site with basic authentication

What is the expected output? What do you see instead?
The crawler should crawl the URL and fetch the data.
But this is not possible, because the crawler can´t connect.

What version of the product are you using?
4.0

Please provide any additional information below.
The code in PageFetcher.java must be changed.

Currently proxy information (and maybe other informations) get lost when 
performing basic authentication.

In method PageFetcher.doBasicLogin(BasicAuthInfo authInfo) a new HttpClient is 
created.

/**
     * BASIC authentication<br/>
     * Official Example:
     * https://hc.apache.org/httpcomponents-client-ga/httpclient/examples/org/apache/http/examples/client/ClientAuthentication
     * .java
     * */
    protected void doBasicLogin(BasicAuthInfo authInfo) {
        HttpHost targetHost = new HttpHost(authInfo.getHost(), authInfo.getPort(), authInfo.getProtocol());
        CredentialsProvider credsProvider = new BasicCredentialsProvider();
        credsProvider.setCredentials(new AuthScope(targetHost.getHostName(), targetHost.getPort()),
                        new UsernamePasswordCredentials(authInfo.getUsername(), authInfo.getPassword()));
        httpClient = HttpClients.custom().setDefaultCredentialsProvider(credsProvider).build();
    }

Original issue reported on code.google.com by wefwefw...@gmail.com on 6 Jan 2015 at 11:19

GoogleCodeExporter commented 8 years ago
We are also experiencing the same problems. Can crawl through proxy until we 
try to crawl a site that requires basic auth, the crawl will fail due to no 
proxy information. 

Original comment by anthony....@gmail.com on 20 May 2015 at 1:47

GoogleCodeExporter commented 8 years ago
I have created a fix for this issue and will be submitting it within the week. 
Need to read up on the process. I have tested it with both the credentials and 
proxy. It no longer throws the original httpClient away. Made the change to 4.1 
release, will try to integrate into 4.2. Have attached the files.

Original comment by anthony....@gmail.com on 26 May 2015 at 7:04

Attachments:

GoogleCodeExporter commented 8 years ago
Better version. with fix for form auth also.

Original comment by anthony....@gmail.com on 26 May 2015 at 9:43

Attachments: