Closed GoogleCodeExporter closed 9 years ago
I checked the PageFetcher code and found that Website authentication is not
supported. Is it possible to achieve the requirement with the current version
of crawler4j? or we need to modify the code. It will be a great help if we get
a sample code which does the access of password protected website(if the
current crawler4j suports this).
Original comment by arjunpn...@gmail.com
on 24 Oct 2011 at 11:43
Second to request, seems like a required feature.
Original comment by slava.ri...@gmail.com
on 23 Nov 2011 at 8:20
I would be really great if such an authentication is supported by the crawler.
Web page login form authentication has been really problem for me too.
Original comment by mansur.u...@gmail.com
on 24 Dec 2011 at 5:39
I have the following remporary solution:
HttpPost httpost = new HttpPost(toFetchURL);
List <NameValuePair> nvps = new ArrayList <NameValuePair>();
//nvps.add(new BasicNameValuePair("formId", "loginform")); //not required in my case
//nvps.add(new BasicNameValuePair("action", "action name")); //not required in my case
nvps.add(new BasicNameValuePair("name of the user name input field", "username"));
nvps.add(new BasicNameValuePair("name of the password input field", "password"));
try {
httpost.setEntity(new UrlEncodedFormEntity(nvps, HTTP.UTF_8));
response = httpclient.execute(httpost);
....
For more details, have a look this link:
http://groovy.329449.n5.nabble.com/Access-login-protected-page-from-external-ser
ver-td4531240.html
Original comment by mansur.u...@gmail.com
on 3 Jan 2012 at 7:09
I want this feature.
Original comment by pikote...@gmail.com
on 31 Mar 2012 at 5:35
I want common PageFetcher or HttpClient among all CrawlController for sharing
cookie.
if anyone want to crawl various sites at random time, it will be needed.
in the current specification, you have to perform
PageFetcher#getHttpClient().execute() for every CrawlController.
Original comment by pikote...@gmail.com
on 31 Mar 2012 at 9:13
my customized crawler4j for login.
and example codes.
Original comment by pikote...@gmail.com
on 2 Apr 2012 at 1:21
Attachments:
[deleted comment]
The login is somehow not working in my case.
This is how the code looks like.
LoginCrawlController.java
somesite = new LoginConfiguration("www.xxx.in", new
URL("http://www.xxx.in/Login.aspx"), new URL("http://www.xxx.in/Login.aspx"));
//The above code uses a .net post back, so the login form and the action are
same.
somesite.addParam("txtName", "myusername");
somesite.addParam("txtPassword", "mypassword");
somesite.addParam("btnLogin", "Login");
controller.addSeed("http://www.xxx.in/GoalSheet.aspx");
In the ProcessPage() method the status code for GoalSheet.aspx is 302 so it
adds the movedToUrl (Login.aspx) to the list of sites.
Original comment by amit.mal...@gmail.com
on 16 May 2012 at 1:24
[deleted comment]
the cause of re-crawling Login.aspx is your crawler failed to login in
"www.xxx.in".
list of causes I can think:
- the URL of action of form
- addParam() settings
- www.xxx.in sees User-Agent so your crawler was rejected.
in my sample code,
"https://secure.xxxxxxxxxxxx.com/login_post" is <form action="here!"> </form>
in html you wanna login.
if the site you wanna login is "http://***.com" and if the URL of action of
form is "", you have to set "http://***.com"
addParam() is for setting post parameters.
the keys of the parameters depend on page.
the values of the parameters depend on your account.
please try other site too if you failed to login yet.
Original comment by pikote...@gmail.com
on 16 May 2012 at 11:09
It works with another site.
Not sure what's wrong with the one I am trying.
- It is a asp.net site with "post back" so the form action is same as the
loginform itself. Even fire bug shows the same url after post.
- The addparams() seems correct, verified in firebug.
- The site doesn't check for user-agent. I am wondering if the asp.Net
__VIEWSTATE etc is expected when the form posts. But there is no way to find
them.
Original comment by asitkuma...@gmail.com
on 17 May 2012 at 1:44
post-back was not needed in my case.
if you have used eclipse, it is easy to develop crawler4j by yourself.
I can say this method will be involved.
edu.uci.ics.crawler4j.crawler.WebCrawler.login(LoginConfiguration)
at least, a good "thin edge of the wedge".
Original comment by pikote...@gmail.com
on 20 May 2012 at 6:19
hi tenta piko
I am stuck in deploying ur login patch code in my desktop. I am using eclipse
workspace to deploy my codes.
Please guide help me to start with your login patch and to work with your
customized code.
Original comment by gnana2...@gmail.com
on 23 Aug 2013 at 12:56
I had moved to another project already.
so I cannot remember details.
Hmmm...I'm sorry.
Original comment by pikote...@gmail.com
on 3 Sep 2013 at 2:02
I needed to use basic HTTP authentication in order to crawl a site. The
following code worked for me. Just do this prior to starting the crawl.
DefaultHttpClient client = (DefaultHttpClient)
controller.getPageFetcher().getHttpClient();
client.getCredentialsProvider().setCredentials(
new AuthScope(AuthScope.ANY_HOST, AuthScope.ANY_PORT, AuthScope.ANY_REALM),
new UsernamePasswordCredentials(userName, password));
Original comment by patrick....@gmail.com
on 17 Sep 2013 at 3:34
Hi patrick, just i would like to ask, does the presented code by you work
properly?
Or, it needs some customizations.....
thanks,
Original comment by aalab...@gmail.com
on 31 Mar 2014 at 1:59
patrick, did you try facebook or twitter with this code? if so, did it work
with you?
Please I need help.........thanks.
Original comment by aalab...@gmail.com
on 31 Mar 2014 at 4:33
Original comment by avrah...@gmail.com
on 18 Aug 2014 at 3:09
Fixed in Rev: 4388892aeb78
Original comment by avrah...@gmail.com
on 26 Nov 2014 at 5:37
Is it now possible to login??
Original comment by ju...@gmx.net
on 1 Dec 2014 at 2:17
Yes,
Download latest from trunk and follow the instructions in our Wiki:
https://code.google.com/p/crawler4j/wiki/Crawling_Password_Protected_Sites
Original comment by avrah...@gmail.com
on 1 Dec 2014 at 2:18
Original issue reported on code.google.com by
arjunpn...@gmail.com
on 24 Oct 2011 at 9:10