mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Enhancement: Add page response/status code in the URL List - To check broken links & parent page #106

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Can we also get the details of the response code(404,301,302 etc.,) & the 
page where the broken link is found along with the URL list?
2.
3.

What is the expected output? What do you see instead?

What version of the product are you using?

Please provide any additional information below.

If we can get it in WebURL or a different List will be nice & useful

Original issue reported on code.google.com by jawaharp...@gmail.com on 15 Jan 2012 at 7:26

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I think it is possible to get URL from parentDocid, it will solve one of the 
above problems. Can you tell me how to get the URL from parentDocid without 
Iterating WebURL? Just want to know if an inbuilt method is hidden somewhere..

Original comment by jawaharp...@gmail.com on 15 Jan 2012 at 7:42

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
DocidServer.getDocid gives you the mapping from a docid to URL.

-Yasser

Original comment by ganjisaffar@gmail.com on 15 Jan 2012 at 5:20

GoogleCodeExporter commented 9 years ago
Great. For the status code, I have added methods to the weburl.java. Can you 
tell me where do I set it? Maybe a one-line code to set it? 
I have this method:
WebURL.setPageStatusCode(int statusCode)

Or is it possible to set via docIdServer or other method? Please let me know 
and will send the fix for this...

Original comment by jawaharp...@gmail.com on 16 Jan 2012 at 1:08

GoogleCodeExporter commented 9 years ago
Btw, DocidServer.getDocid returns int ?

public int getDocId(String url)   -- DocIdServer.java

Original comment by jawaharp...@gmail.com on 16 Jan 2012 at 1:21

GoogleCodeExporter commented 9 years ago
You're right. You can't get the url from DocidServer unless you do a loop over 
URLs which doesn't make sense. For the status code you can get it in WebCrawler 
(fetchResult object).

-Yasser

Original comment by ganjisaffar@gmail.com on 16 Jan 2012 at 3:20

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I am able to get the status code, but setting it in the WebURL is the problem. 
How do I set it against the URL. It works only when I set it in WebURL and 
schedule it for crawl.. What if I dont schedule it for pages like 404 etc., ?

Original comment by jawaharp...@gmail.com on 16 Jan 2012 at 3:45

GoogleCodeExporter commented 9 years ago
Hi Yasser
Any help on setting the statuscode in WebURL Or any other object? Basically, I 
want to access them in shouldVisit method ??

Regs

Original comment by w3engine...@gmail.com on 19 Jan 2012 at 4:38

GoogleCodeExporter commented 9 years ago
This feature is now implemented (available in source codes) and will be 
included in the next release. See 
http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawl
er4j/examples/statushandler/ for an example.

-Yasser

Original comment by ganjisaffar@gmail.com on 22 Jan 2012 at 8:10

GoogleCodeExporter commented 9 years ago
I checked it. THanks. But right now I cannot find the URL of the page that has 
the broken link. parentURL is required in this case as well.

Original comment by w3engine...@gmail.com on 27 Jan 2012 at 11:02