jesbin / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Adding seeds to crawler4j at runtime #309

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. A crawler thread is fired off at the beginning with a few seeds which are 
crawled without any issues.
2. Now the crawler thread is waiting for new urls to be added to the 
'Frontier', and the monitor thread waits for 30 seconds before shutting 
everything down.
3. After say 10 seconds I add a new seed to the frontier by calling the 
'addSeed(String pageUrl, int docId)' function and providing the url and my own 
random doc Id. The WebURL is successfully added to the frontier by calling the 
function 'frontier.schedule(webUrl);'
4. Now the problem comes in. Inside the 'schedule(WebURL url)' function if you 
look at the source code, the WebURL is added to the workQueues however the 
WebCrawler waits for new urls and is waiting for the monitor object to be 
notified ('getNextURLs' method). The 'schedule(WebURL url)' function is missing 
a notify call and hence the crawler just simply waits and eventually downs 
after the 30 second period is over.
What is the expected output? What do you see instead?
-

What version of the product are you using?
crawler4j-3.5

Please provide any additional information below.
-

Original issue reported on code.google.com by bassim.b...@googlemail.com on 18 Sep 2014 at 8:51