Closed GoogleCodeExporter closed 9 years ago
[deleted comment]
I think it is possible to get URL from parentDocid, it will solve one of the
above problems. Can you tell me how to get the URL from parentDocid without
Iterating WebURL? Just want to know if an inbuilt method is hidden somewhere..
Original comment by jawaharp...@gmail.com
on 15 Jan 2012 at 7:42
[deleted comment]
DocidServer.getDocid gives you the mapping from a docid to URL.
-Yasser
Original comment by ganjisaffar@gmail.com
on 15 Jan 2012 at 5:20
Great. For the status code, I have added methods to the weburl.java. Can you
tell me where do I set it? Maybe a one-line code to set it?
I have this method:
WebURL.setPageStatusCode(int statusCode)
Or is it possible to set via docIdServer or other method? Please let me know
and will send the fix for this...
Original comment by jawaharp...@gmail.com
on 16 Jan 2012 at 1:08
Btw, DocidServer.getDocid returns int ?
public int getDocId(String url) -- DocIdServer.java
Original comment by jawaharp...@gmail.com
on 16 Jan 2012 at 1:21
You're right. You can't get the url from DocidServer unless you do a loop over
URLs which doesn't make sense. For the status code you can get it in WebCrawler
(fetchResult object).
-Yasser
Original comment by ganjisaffar@gmail.com
on 16 Jan 2012 at 3:20
[deleted comment]
I am able to get the status code, but setting it in the WebURL is the problem.
How do I set it against the URL. It works only when I set it in WebURL and
schedule it for crawl.. What if I dont schedule it for pages like 404 etc., ?
Original comment by jawaharp...@gmail.com
on 16 Jan 2012 at 3:45
Hi Yasser
Any help on setting the statuscode in WebURL Or any other object? Basically, I
want to access them in shouldVisit method ??
Regs
Original comment by w3engine...@gmail.com
on 19 Jan 2012 at 4:38
This feature is now implemented (available in source codes) and will be
included in the next release. See
http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawl
er4j/examples/statushandler/ for an example.
-Yasser
Original comment by ganjisaffar@gmail.com
on 22 Jan 2012 at 8:10
I checked it. THanks. But right now I cannot find the URL of the page that has
the broken link. parentURL is required in this case as well.
Original comment by w3engine...@gmail.com
on 27 Jan 2012 at 11:02
Original issue reported on code.google.com by
jawaharp...@gmail.com
on 15 Jan 2012 at 7:26