jesbin / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

How to do NTLM Authentication ? #250

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Use this URL as seed : 
http://www-scopus-com.ezlibproxy1.ntu.edu.sg/authid/detail.url?authorId=14831850
700

2. It will redirect 2 times to 
https://ezproxylogin1.ntu.edu.sg/restricted/login_sso.asp?logup=false&url=http%3
A%2F%2Fwww.scopus.com%2Fauthid%2Fdetail.url%3FauthorId%3D14831850700

3. At the 2nd redirect link, it will stop redirecting and return a 401 
unauthorised code.

What is the expected output? What do you see instead?
Expected Output is status code 200 and retrieving the content.

What i see is status code 401(unauthorised)

What version of the product are you using?
Crawler4j 3.5

Please provide any additional information below.

I tried performing authentication in PageFetcher.java.

HttpContext localContext = new BasicHttpContext(); 
            CredentialsProvider credsProvider = new BasicCredentialsProvider();
            credsProvider.setCredentials(AuthScope.ANY,
                    new NTCredentials("myusername", "mypassword", "", "ezproxylogin1.ntu.edu.sg"));
            ArrayList<String> authtypes = new ArrayList<String>();
                authtypes.add(AuthPolicy.NTLM);      
                httpClient.getParams().setParameter(AuthPNames.TARGET_AUTH_PREF,authtypes);

            localContext.setAttribute(ClientContext.CREDS_PROVIDER, credsProvider);

            get.addHeader("Accept-Encoding", "gzip");
            HttpResponse response = httpClient.execute(get,localContext);

However the response is always 401. Anyone understands ntlm authentication and 
able to help with this? Thanks

Original issue reported on code.google.com by vincent....@gmail.com on 17 Jan 2014 at 6:45

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Hi Vincent,

I've tested some code lines and it works for me. Please follow these steps:

Step one:

In PageFetcher juste after creating httpClient Object add these lines:

httpClient.getAuthSchemes().register(AuthPolicy.NTLM, new AuthSchemeFactory () {

  public AuthScheme newInstance(HttpParams params) {
    return new NTLMScheme(new JCIFSEngine());
  }
});

httpClient.getCredentialsProvider().setCredentials(AuthScope.ANY, new 
NTCredentials(USER, PASS, HOST, DOMAIN));

Step two (you can find the new classe as attachment):

- Create a new Class called JCIFSEngine in edu.uci.ics.crawler4j.auth package 
for example.
- Copy the code from this link 
https://hc.apache.org/httpcomponents-client-4.3.x/ntlm.html into JCIFSEngine  
Classe.

Step tree:
Enjoy.

Maybe crawler4j Team can integrate properly these lines of code. And it will be 
good if we can choose between kinds of authentication.
We can image somthing like that in PageFetcher:

    protected void configureHttpClientAuth() {
        if (isNTLMAuth()) {
            configureNTLMHttpClient();
        } else if (isBASICAuth()) {
            configureBasicHttpClient();
        } else if (isNegotiateAuth()) {
            configureNegotiateHttpClient();
        } else {
            logger.info("No authentication to configure.");
        }
    }

Thanks.

Nizar.

Original comment by nizar.salhaji@gmail.com on 16 Jul 2014 at 1:54

Attachments:

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:48

GoogleCodeExporter commented 8 years ago
I have created a way to login in the latest Rev: 4388892aeb78

Grab latest from trunk and see if it works for you.

If it doesn't then just integrate the NTLM authentication into it and send me a 
patch which I can integrate into the core.

Original comment by avrah...@gmail.com on 26 Nov 2014 at 5:38

GoogleCodeExporter commented 8 years ago
Look at Mario's Code here:
https://code.google.com/p/crawler4j/wiki/Crawling_Password_Protected_Sites

Original comment by avrah...@gmail.com on 2 Dec 2014 at 10:12