mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Incorrectly revisit pages when resuming because not deleting url in frontier correctly #191

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
    1. set seed as http://www.ics.uci.edu/, set Resumable to true : setResumableCrawling(true);
    2. print out visited url in processing page
    3. run the crawler.
    4. stop the crawler by close JVM or stop running in IDE.
    5. rerun the crawler without touching any file in frontier.

What is the expected output? What do you see instead?
We should expect that the resuming without revisiting visited pages.
And warning messages like 
WARN [Crawler 1] Could not remove: [URL] list of processed pages.
should not be printed.
Actually the pages will be revisited and there are warning messages like above 
indicating incorrectly deleting url from frontier database.

What version of the product are you using?
3.3

Please provide any additional information below.
Below is the page visiting log by two crawling. As you can see, the pages are 
revisited.
http://www.ics.uci.edu/
http://www.ics.uci.edu/about/
http://www.ics.uci.edu/about/about_mission.php
http://www.ics.uci.edu/about/about_deanmsg.php
http://www.ics.uci.edu/about/bren/
http://www.ics.uci.edu/about/about_facts.php
http://www.ics.uci.edu/about/brenhall/index.php
http://www.ics.uci.edu/about/visit/index.php
http://www.ics.uci.edu/about/about_contact.php
http://www.ics.uci.edu/about/equity/
http://www.ics.uci.edu/
http://www.ics.uci.edu/about/
http://www.ics.uci.edu/about/about_mission.php
http://www.ics.uci.edu/about/about_deanmsg.php
http://www.ics.uci.edu/about/bren/
http://www.ics.uci.edu/about/about_facts.php

Original issue reported on code.google.com by iloves...@gmail.com on 6 Feb 2013 at 8:07

GoogleCodeExporter commented 9 years ago
This is fixed in this changelist: 
https://code.google.com/p/crawler4j/source/detail?r=3615d8f58ea3072b85007611a547
cf21bccbfb18

Thanks,
Yasser

Original comment by ganjisaffar@gmail.com on 2 Mar 2013 at 10:15