mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Add parenturl in webURL #113

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Add getParentUrl method in webURL. This can be used to check which page 
links to the current page & also which page has broken links etc.,
2.
3.

What is the expected output? What do you see instead?
Right now, we get the parentDocId, from which we cannot get the parentUrl.

What version of the product are you using?

Please provide any additional information below.

Original issue reported on code.google.com by w3engine...@gmail.com on 23 Jan 2012 at 3:07

GoogleCodeExporter commented 9 years ago
And also the siteURL. I will be supplying a lot of seed URLs, but in visit() 
method, I need to know the siteURL and parentURL  as well.

Original comment by w3engine...@gmail.com on 23 Jan 2012 at 9:30

GoogleCodeExporter commented 9 years ago
Good features.

Original comment by mansur.u...@gmail.com on 24 Jan 2012 at 9:53

GoogleCodeExporter commented 9 years ago
Would be a very nice feature. +1

Original comment by milkdata...@gmail.com on 24 Jan 2012 at 10:32

GoogleCodeExporter commented 9 years ago
Hi yasser... Any solutions for this ? 

Original comment by w3engine...@gmail.com on 31 Jan 2012 at 11:31

GoogleCodeExporter commented 9 years ago
Hi,

I just changed the WebURL class:

class WebURL {
        ...

        private boolean isBaseUrlSet;
        private String baseURL; //it is site url

        ...

        public void setURL(String url) {
        this.url = url; //redirected url;

        //set only once
        if(!isBaseUrlSet) { 
            baseURL = url;
            isBaseUrlSet = true;
        }
    }

       ...

}

Parent url can only be set in WebCrawler.processPage(..) method. This means 
WebURL has to be changed to have this functionality for the time being.

Regs

Original comment by mansur.u...@gmail.com on 31 Jan 2012 at 1:03

GoogleCodeExporter commented 9 years ago
I am not sure if this solution works, because parent URL should be also 
persisted if it is needed in the visit method. Anyway, I will try to add this 
feature over the weekend.

-Yasser

Original comment by ganjisaffar@gmail.com on 31 Jan 2012 at 8:57

GoogleCodeExporter commented 9 years ago
I have dont it as well and it works fine. Created a get and set method for 
parentURL in webURL:

public String getParentURL() {
    return parentURL;
}

public void setParentURL(String parentURL) {
    this.parentURL = parentURL;
}

and added this in WebCrawler.java(at 2 places line No: 249 & 284):

webURL.setParentURL(curURL.getURL());

But the main issue was how to get the parentURL for the broken links in the 
below method:

handlePageStatusCode(WebURL webUrl, int statusCode, String statusDescription)

Any suggestions for that?

Original comment by w3engine...@gmail.com on 1 Feb 2012 at 2:44

GoogleCodeExporter commented 9 years ago
This is currently implemented in the source code: 
http://code.google.com/p/crawler4j/source/detail?r=2f6a89cfd07bf6e87f92f361359d0
fbca81b634d

Will be included in the next release.

-Yasser

Original comment by ganjisaffar@gmail.com on 4 Feb 2012 at 10:10