Page already crawled gets crawled again.

GoogleCodeExporter commented 9 years ago

On crawling a page with seedLink "http://www.example.com/a1/a1"
There comes a child page say,
"http://www.example.com/a1/b1"

from this b1 page, there is a link to a1 page again but this a1 link is 
redirect link i.e.
http://www.example.com/a1 which will redirect to www.example.com/a1/a1.
Now, the page a1 is already crawled but it will get crawled again.

I am facing this issue.
However, for the time I have fixed the problem by making changes in WebCrawler 
processPage() method,

if (statusCode != 200)
              {
                String movedToUrl = fetchResult.getMovedToUrl();
                WebURL webURL = createRedirectWebURL(curURL, movedToUrl);
                if ((webURL.getRedirectCount() < this.myController.getConfig().getMaxRedirectCount()) && (shouldVisit(webURL)) && 
                  (this.robotstxtServer.allows(webURL))) {
              +  -->    if(!docIdServer.isSeenBefore(movedToUrl)){
                          webURL.setDocid(this.docIdServer.getNewDocID(movedToUrl));
                          this.frontier.schedule(webURL);
              +  -->    }
                }
                return;
              }

I hope that there can be a much finer solution.

Original issue reported on code.google.com by Swarnaka...@gmail.com on 25 Apr 2013 at 10:56

GoogleCodeExporter commented 9 years ago

I need a real URL with that scenario so I can test it

Original comment by avrah...@gmail.com on 11 Aug 2014 at 2:39

GoogleCodeExporter commented 9 years ago

Ok, so we are trying with this url: http://off.net.mk/ to reproduce this issue.
And the crawler gets the same URLs again and again like it's in the loop.
We hope that there is some solution for this issue.

Original comment by ilce.bog...@x3mlabs.com on 12 Aug 2014 at 4:00

GoogleCodeExporter commented 9 years ago

I ran the basic crawler on your seed:  http://off.net.mk/

I did it for 15 minutes and crawled 500 URLs.

No URL was duplicate.

I logged every URL I visited, and am attaching the list

As you can see - no URL appears twice.

This is my "should visit":
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches() && 
href.startsWith("http://off.net.mk/");

This is my "visit" method:
String url = page.getWebURL().getURL();
logger.info("VISIT: {}", url);
if (!visitedUrls.add(url)) { // THis is a set
    logger.error("Alert!, Same URL: {}", url);
}

Original comment by avrah...@gmail.com on 13 Aug 2014 at 6:24

Changed state: Accepted

Attachments:

crawler-log-sorted.txt

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Hi Avi,

We know that it's not an issue on 500 pages. When we left the crawler for 1 or 
2 days with about 10 threads it happens. There are situations where the same 
URL is coming to visit method for about 150~ times. I will suggest you to leave 
your crawler on mention URL for a longer time and log every url like we do.

Thanks.

Original comment by emrah.me...@x3mlabs.com on 13 Aug 2014 at 8:03

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Thank you for the quick answer. But as Emrah says this issue is happening on a 
long run, let say above 50 000 pages.
Thank you again.

Original comment by ilce.bog...@x3mlabs.com on 13 Aug 2014 at 8:08

GoogleCodeExporter commented 9 years ago

Folks,

I am sorry for a late reply.
The server against which I faced this issue is down. I am looking for some 
other site where this can be easily reproduced.
Or maybe I may create a simple web-server where I can show you the issue 
getting reproduced.
Please lent me some time for this. And I will be back soon.

Thanks.
Manish Swarnakar

Original comment by Swarnaka...@gmail.com on 13 Aug 2014 at 8:55

GoogleCodeExporter commented 9 years ago

I am re-testing and leaving it for the night.

Please note though that if a server gets down during testing it might be due to 
the crawler hammering it with requests and causing a DDOS attack...

Original comment by avrah...@gmail.com on 13 Aug 2014 at 7:52

GoogleCodeExporter commented 9 years ago

10 crawlers
No politeness
Maximum depth of crawling
Maximum number of pages

Will run it now for at least 14 hours and revert with conclusions

Original comment by avrah...@gmail.com on 13 Aug 2014 at 8:02

GoogleCodeExporter commented 9 years ago

Hey Avi,

Thanks for your feedback, don't worry about the Server. Let's see when you have 
50 000 - 100 000 pages crawled what will happen.

In the meanwhile how you are trying to check if you have duplicate url if you 
put all urls in file?

About DB used by crawler4j: i suppose it's using in memory DB: "Oracle Berkeley 
DB". My Question here is what will happen if the limit of the DB is full like 
if it has limit from 300MB in RAM and we are exceeding? or there is no limit 
for the size of the DB?!!

Thanks.

Original comment by emrah.me...@x3mlabs.com on 14 Aug 2014 at 7:28

GoogleCodeExporter commented 9 years ago

Folks,

The issue didn't reproduced for me with the latest crawl4j release version 3.5.
I saw the code...

WebCrawler.class

Method - processPage()

if (statusCode != HttpStatus.SC_OK) {
                                if (statusCode == HttpStatus.SC_MOVED_PERMANENTLY || statusCode == HttpStatus.SC_MOVED_TEMPORARILY) {
                                        if (myController.getConfig().isFollowRedirects()) {
                                                String movedToUrl = fetchResult.getMovedToUrl();
                                                if (movedToUrl == null) {
                            logger.warn("Unexpected error, URL: {} is redirected to NOTHING", curURL);
                                                        return;
                                                }
// The code below takes care of the issue that was raised about an year ago.
// Surely, later changes might have resolved the issue.
                                                int newDocId = docIdServer.getDocId(movedToUrl);
                                                if (newDocId > 0) {
                                                        logger.debug("Redirect page: {} is already seen", curURL);
                                                        return;
                                                }
As Emrah says the issue is happening with long runs still yet.
You can check and decide on this.

Thanks

Original comment by Swarnaka...@gmail.com on 14 Aug 2014 at 10:58

GoogleCodeExporter commented 9 years ago

Ok guys, these are my findings.

My crawler ran for more than 15 hours.
Not one URL was repeated in the "visit" method.

BerklyDB is based on disk and not memory so it shouldn't finish your memory 
although I might be wrong and a profiler is a good tool to check this.

How did I check ?
I created a set of Strings and populated it with the visiting URL.
And while adding a URL to the set I checked if it existed there before, if it 
did I logged a special error.

This error was never logged although I crawled more than 50,000 links from this 
domain!

this is my specific code I used in the visit method:
String url = page.getWebURL().getURL();
if (!visitedUrls.add(url)) {
    logger.error("Alert!, Same URL: {}", url);
}

You can see the Set :: add method's javadoc

The method of adding all urls to a set isn't memory optimized, but it served me 
well as a quick hack and before I shut the crawler down it consumed 800mb of 
memory.

Original comment by avrah...@gmail.com on 14 Aug 2014 at 12:44

GoogleCodeExporter commented 9 years ago

Hey Avi,

Are you on 3.6 up-to-date (trunk) release or 3.5? i mean which version did you 
use for reproducing this issue?

About DB, why always data folder is about up to 1mb and memory usage for the 
crawler process is more then 300MB?

Thanks.

Original comment by emrah.me...@x3mlabs.com on 15 Aug 2014 at 1:14

GoogleCodeExporter commented 9 years ago

I always use the latest from trunk. (which is 3.6 SNAPSHOT)

About the DB, I am not sure thus I don't want to state an opinion in something 
I am not learned yet.

But that is a very good question to put on the forum

Original comment by avrah...@gmail.com on 17 Aug 2014 at 4:29

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 23 Sep 2014 at 2:08

Changed state: Fixed

momzi / crawler4j

Page already crawled gets crawled again. #215