Closed GoogleCodeExporter closed 9 years ago
I need a real URL with that scenario so I can test it
Original comment by avrah...@gmail.com
on 11 Aug 2014 at 2:39
Ok, so we are trying with this url: http://off.net.mk/ to reproduce this issue.
And the crawler gets the same URLs again and again like it's in the loop.
We hope that there is some solution for this issue.
Original comment by ilce.bog...@x3mlabs.com
on 12 Aug 2014 at 4:00
I ran the basic crawler on your seed: http://off.net.mk/
I did it for 15 minutes and crawled 500 URLs.
No URL was duplicate.
I logged every URL I visited, and am attaching the list
As you can see - no URL appears twice.
This is my "should visit":
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches() &&
href.startsWith("http://off.net.mk/");
This is my "visit" method:
String url = page.getWebURL().getURL();
logger.info("VISIT: {}", url);
if (!visitedUrls.add(url)) { // THis is a set
logger.error("Alert!, Same URL: {}", url);
}
Original comment by avrah...@gmail.com
on 13 Aug 2014 at 6:24
Attachments:
[deleted comment]
Hi Avi,
We know that it's not an issue on 500 pages. When we left the crawler for 1 or
2 days with about 10 threads it happens. There are situations where the same
URL is coming to visit method for about 150~ times. I will suggest you to leave
your crawler on mention URL for a longer time and log every url like we do.
Thanks.
Original comment by emrah.me...@x3mlabs.com
on 13 Aug 2014 at 8:03
[deleted comment]
Thank you for the quick answer. But as Emrah says this issue is happening on a
long run, let say above 50 000 pages.
Thank you again.
Original comment by ilce.bog...@x3mlabs.com
on 13 Aug 2014 at 8:08
Folks,
I am sorry for a late reply.
The server against which I faced this issue is down. I am looking for some
other site where this can be easily reproduced.
Or maybe I may create a simple web-server where I can show you the issue
getting reproduced.
Please lent me some time for this. And I will be back soon.
Thanks.
Manish Swarnakar
Original comment by Swarnaka...@gmail.com
on 13 Aug 2014 at 8:55
I am re-testing and leaving it for the night.
Please note though that if a server gets down during testing it might be due to
the crawler hammering it with requests and causing a DDOS attack...
Original comment by avrah...@gmail.com
on 13 Aug 2014 at 7:52
10 crawlers
No politeness
Maximum depth of crawling
Maximum number of pages
Will run it now for at least 14 hours and revert with conclusions
Original comment by avrah...@gmail.com
on 13 Aug 2014 at 8:02
Hey Avi,
Thanks for your feedback, don't worry about the Server. Let's see when you have
50 000 - 100 000 pages crawled what will happen.
In the meanwhile how you are trying to check if you have duplicate url if you
put all urls in file?
About DB used by crawler4j: i suppose it's using in memory DB: "Oracle Berkeley
DB". My Question here is what will happen if the limit of the DB is full like
if it has limit from 300MB in RAM and we are exceeding? or there is no limit
for the size of the DB?!!
Thanks.
Original comment by emrah.me...@x3mlabs.com
on 14 Aug 2014 at 7:28
Folks,
The issue didn't reproduced for me with the latest crawl4j release version 3.5.
I saw the code...
WebCrawler.class
Method - processPage()
if (statusCode != HttpStatus.SC_OK) {
if (statusCode == HttpStatus.SC_MOVED_PERMANENTLY || statusCode == HttpStatus.SC_MOVED_TEMPORARILY) {
if (myController.getConfig().isFollowRedirects()) {
String movedToUrl = fetchResult.getMovedToUrl();
if (movedToUrl == null) {
logger.warn("Unexpected error, URL: {} is redirected to NOTHING", curURL);
return;
}
// The code below takes care of the issue that was raised about an year ago.
// Surely, later changes might have resolved the issue.
int newDocId = docIdServer.getDocId(movedToUrl);
if (newDocId > 0) {
logger.debug("Redirect page: {} is already seen", curURL);
return;
}
As Emrah says the issue is happening with long runs still yet.
You can check and decide on this.
Thanks
Original comment by Swarnaka...@gmail.com
on 14 Aug 2014 at 10:58
Ok guys, these are my findings.
My crawler ran for more than 15 hours.
Not one URL was repeated in the "visit" method.
BerklyDB is based on disk and not memory so it shouldn't finish your memory
although I might be wrong and a profiler is a good tool to check this.
How did I check ?
I created a set of Strings and populated it with the visiting URL.
And while adding a URL to the set I checked if it existed there before, if it
did I logged a special error.
This error was never logged although I crawled more than 50,000 links from this
domain!
this is my specific code I used in the visit method:
String url = page.getWebURL().getURL();
if (!visitedUrls.add(url)) {
logger.error("Alert!, Same URL: {}", url);
}
You can see the Set :: add method's javadoc
The method of adding all urls to a set isn't memory optimized, but it served me
well as a quick hack and before I shut the crawler down it consumed 800mb of
memory.
Original comment by avrah...@gmail.com
on 14 Aug 2014 at 12:44
Hey Avi,
Are you on 3.6 up-to-date (trunk) release or 3.5? i mean which version did you
use for reproducing this issue?
About DB, why always data folder is about up to 1mb and memory usage for the
crawler process is more then 300MB?
Thanks.
Original comment by emrah.me...@x3mlabs.com
on 15 Aug 2014 at 1:14
I always use the latest from trunk. (which is 3.6 SNAPSHOT)
About the DB, I am not sure thus I don't want to state an opinion in something
I am not learned yet.
But that is a very good question to put on the forum
Original comment by avrah...@gmail.com
on 17 Aug 2014 at 4:29
Original comment by avrah...@gmail.com
on 23 Sep 2014 at 2:08
Original issue reported on code.google.com by
Swarnaka...@gmail.com
on 25 Apr 2013 at 10:56