mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

How to crawl a pasword protected website. Can you provide some samples fro the same where authentication is involved #88

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
How to crawl a pasword protected website. Can you provide some samples fro the 
same where authentication is involved

Original issue reported on code.google.com by arjunpn...@gmail.com on 24 Oct 2011 at 9:10

GoogleCodeExporter commented 9 years ago
I checked the PageFetcher code and found that Website authentication is not 
supported. Is it possible to achieve the requirement with the current version 
of crawler4j? or we need to modify the code. It will be a great help if we get 
a sample code which does the access of password protected website(if the 
current crawler4j suports this).

Original comment by arjunpn...@gmail.com on 24 Oct 2011 at 11:43

GoogleCodeExporter commented 9 years ago
Second to request, seems like a required feature.

Original comment by slava.ri...@gmail.com on 23 Nov 2011 at 8:20

GoogleCodeExporter commented 9 years ago
I would be really great if such an authentication is supported by the crawler. 
Web page login form authentication has been really problem for me too.

Original comment by mansur.u...@gmail.com on 24 Dec 2011 at 5:39

GoogleCodeExporter commented 9 years ago
I have the following remporary solution:      

HttpPost httpost = new HttpPost(toFetchURL);
        List <NameValuePair> nvps = new ArrayList <NameValuePair>();
        //nvps.add(new BasicNameValuePair("formId", "loginform")); //not required in my case
        //nvps.add(new BasicNameValuePair("action", "action name")); //not required in my case
        nvps.add(new BasicNameValuePair("name of the user name input field", "username"));
        nvps.add(new BasicNameValuePair("name of the password input field", "password"));

try {
        httpost.setEntity(new UrlEncodedFormEntity(nvps, HTTP.UTF_8));
    response = httpclient.execute(httpost);
....

For more details, have a look this link: 
http://groovy.329449.n5.nabble.com/Access-login-protected-page-from-external-ser
ver-td4531240.html

Original comment by mansur.u...@gmail.com on 3 Jan 2012 at 7:09

GoogleCodeExporter commented 9 years ago
I want this feature.

Original comment by pikote...@gmail.com on 31 Mar 2012 at 5:35

GoogleCodeExporter commented 9 years ago
I want common PageFetcher or HttpClient among all CrawlController for sharing 
cookie.
if anyone want to crawl various sites at random time, it will be needed.
in the current specification, you have to perform 
PageFetcher#getHttpClient().execute() for every CrawlController. 

Original comment by pikote...@gmail.com on 31 Mar 2012 at 9:13

GoogleCodeExporter commented 9 years ago
my customized crawler4j for login.
and example codes.

Original comment by pikote...@gmail.com on 2 Apr 2012 at 1:21

Attachments:

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
The login is somehow not working in my case.
This is how the code looks like.
LoginCrawlController.java

somesite = new LoginConfiguration("www.xxx.in", new 
URL("http://www.xxx.in/Login.aspx"), new URL("http://www.xxx.in/Login.aspx"));
//The above code uses a .net post back, so the login form and the action are 
same.
somesite.addParam("txtName", "myusername");
somesite.addParam("txtPassword", "mypassword");
somesite.addParam("btnLogin", "Login");

controller.addSeed("http://www.xxx.in/GoalSheet.aspx");

In the ProcessPage() method the status code for GoalSheet.aspx is 302 so it 
adds the movedToUrl (Login.aspx) to the list of sites. 

Original comment by amit.mal...@gmail.com on 16 May 2012 at 1:24

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
the cause of re-crawling Login.aspx is your crawler failed to login in 
"www.xxx.in".
list of causes I can think:
- the URL of action of form
- addParam() settings
- www.xxx.in sees User-Agent so your crawler was rejected.

in my sample code,
"https://secure.xxxxxxxxxxxx.com/login_post" is <form action="here!"> </form> 
in  html you wanna login.
if the site you wanna login is "http://***.com" and if the URL of action of 
form is "", you have to set "http://***.com"

addParam() is for setting post parameters.
the keys of the parameters depend on page.
the values of the parameters depend on your account.

please try other site too if you failed to login yet.

Original comment by pikote...@gmail.com on 16 May 2012 at 11:09

GoogleCodeExporter commented 9 years ago
It works with another site. 

Not sure what's wrong with the one I am trying.
- It is a asp.net site with "post back" so the form action is same as the 
loginform itself. Even fire bug shows the same url after post.

- The addparams() seems correct, verified in firebug.

- The site doesn't check for user-agent. I am wondering if the asp.Net 
__VIEWSTATE etc is expected when the form posts. But there is no way to find 
them.

Original comment by asitkuma...@gmail.com on 17 May 2012 at 1:44

GoogleCodeExporter commented 9 years ago
post-back was not needed in my case.
if you have used eclipse, it is easy to develop crawler4j by yourself.

I can say this method will be involved.
edu.uci.ics.crawler4j.crawler.WebCrawler.login(LoginConfiguration)
at least, a good "thin edge of the wedge".

Original comment by pikote...@gmail.com on 20 May 2012 at 6:19

GoogleCodeExporter commented 9 years ago
hi tenta piko

I am stuck in deploying ur login patch code in my desktop. I am using eclipse 
workspace to deploy my codes.
Please guide help me to start with your login patch and to work with your 
customized code.

Original comment by gnana2...@gmail.com on 23 Aug 2013 at 12:56

GoogleCodeExporter commented 9 years ago
I had moved to another project already.
so I cannot remember details.

Hmmm...I'm sorry.

Original comment by pikote...@gmail.com on 3 Sep 2013 at 2:02

GoogleCodeExporter commented 9 years ago
I needed to use basic HTTP authentication in order to crawl a site.  The 
following code worked for me.  Just do this prior to starting the crawl.

DefaultHttpClient client = (DefaultHttpClient) 
controller.getPageFetcher().getHttpClient();
            client.getCredentialsProvider().setCredentials(
                    new AuthScope(AuthScope.ANY_HOST, AuthScope.ANY_PORT, AuthScope.ANY_REALM),
                    new UsernamePasswordCredentials(userName, password));

Original comment by patrick....@gmail.com on 17 Sep 2013 at 3:34

GoogleCodeExporter commented 9 years ago
Hi patrick, just i would like to ask, does the presented code by you work 
properly?
Or, it needs some customizations.....

thanks,

Original comment by aalab...@gmail.com on 31 Mar 2014 at 1:59

GoogleCodeExporter commented 9 years ago
patrick, did you try facebook or twitter with this code? if so, did it work 
with you?
Please I need help.........thanks.

Original comment by aalab...@gmail.com on 31 Mar 2014 at 4:33

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:09

GoogleCodeExporter commented 9 years ago
Fixed in Rev: 4388892aeb78

Original comment by avrah...@gmail.com on 26 Nov 2014 at 5:37

GoogleCodeExporter commented 9 years ago
Is it now possible to login?? 

Original comment by ju...@gmx.net on 1 Dec 2014 at 2:17

GoogleCodeExporter commented 9 years ago
Yes,

Download latest from trunk and follow the instructions in our Wiki:
https://code.google.com/p/crawler4j/wiki/Crawling_Password_Protected_Sites

Original comment by avrah...@gmail.com on 1 Dec 2014 at 2:18