How to Stop crawler and then restarting web crawler with different seeds

GoogleCodeExporter commented 9 years ago


Hi,
Is there any method that can be used to stop  crawler and again restart crawler.
Can provide new seeds to crawler at run time?

Original issue reported on code.google.com by mishra....@gmail.com on 16 Dec 2010 at 8:38

GoogleCodeExporter commented 9 years ago

Also is it possible to resume crawling with different depth or thread?

Original comment by utkuson...@yahoo.com on 21 May 2011 at 10:22

GoogleCodeExporter commented 9 years ago

Yes, if you checkout the latest version from svn you can enable resuming in the 
crawler4j.properties file and then resume crawling with different seeds, depth, 
threads, ...

-Yasser

Original comment by ganjisaffar@gmail.com on 21 May 2011 at 4:07

GoogleCodeExporter commented 9 years ago

Hi,

Is there any example on how can I stop the crawler, modify the seeds and start 
the crawler again during runtime ? Do I have to use another controller ?

I already did a checkout the r21 from subversion. I also enable the resume 
feature in the crawler4j.properties. I can start the crawler with a seed, but 
if I try to add a new seed to the controller, it throws an exception. I want to 
remove the old seed and add a new one.

Thanks.

Original comment by asiem...@gmail.com on 23 May 2011 at 6:28

GoogleCodeExporter commented 9 years ago

I have to add that I too can not get the resume feature to work.  Here is what 
I get when I attempt to start from a new root:

java.lang.IllegalThreadStateException
    at java.lang.Thread.start(Unknown Source)
    at edu.uci.ics.crawler4j.crawler.PageFetcher.startConnectionMonitorThread(PageFetcher.java:124)
    at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:82)
    at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:56)

Original comment by grjas...@gmail.com on 12 Jun 2011 at 4:08

GoogleCodeExporter commented 9 years ago

I met the same problem,too.
java.lang.IllegalThreadStateException
    at java.lang.Thread.start(Thread.java:595)
    at edu.uci.ics.crawler4j.crawler.PageFetcher.startConnectionMonitorThread(PageFetcher.java:124)
    at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:82)
    at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:56)

Original comment by pugdo...@gmail.com on 17 Jul 2011 at 7:22

GoogleCodeExporter commented 9 years ago

Any solution to this feature ? Is it possible to run multiple crawlers with 
different seeds and configuration ? Do I need to define separate controllers ?

Original comment by sham...@gmail.com on 20 Sep 2011 at 6:51

GoogleCodeExporter commented 9 years ago

In the current version, you can have only one instance of the crawler in a 
single process. Still, you can initiate different processes for different 
crawls. 

-Yasser

Original comment by ganjisaffar@gmail.com on 22 Sep 2011 at 4:17

GoogleCodeExporter commented 9 years ago

you can use  controller.start(MyCrawler.class, 1); for single thread crawler.

Original comment by mishra....@gmail.com on 24 Sep 2011 at 6:00

GoogleCodeExporter commented 9 years ago

@Yasser will it be possible to start more than 1 crawler per process in the 
future?
That would be very helpfully :-)

Original comment by seoprogr...@googlemail.com on 27 Sep 2011 at 7:52

GoogleCodeExporter commented 9 years ago

i created a shell-skript that reads a list of domains (from the file 
crawling_domains.txt) for crawling and starts a new java-process:

#!/bin/bash

# path variabel
path=$1
if [ $1  ]
  then path=$1
  else path="/opt/openseodata/"
fi
inputDatei="../../crawling_domains.txt"

# DomainCrawlerJob
cd $path"jobs/DomainCrawlerJob/"
rm nohup.out

# prozesse starten
cat $inputDatei | while read a; do
java -Xmx2048m -Dcrawler.domains=$a -DanzahlThreads=20 -jar 
../openseodata-jobs-jar-with-dependencies.jar DomainCrawlerJob > nohup.out
done

Original comment by seoprogr...@googlemail.com on 27 Sep 2011 at 11:13

GoogleCodeExporter commented 9 years ago

Yes, I have added this feature to my top priority list for next release 
(multiple crawlers in the same process).

-Yasser

Original comment by ganjisaffar@gmail.com on 28 Sep 2011 at 1:56

Changed state: Accepted
Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

As of version 3.0, this feature is fully supported. You can enable resumable 
crawling, add seeds during crawling, add seeds on the second run, ...

-Yasser

Original comment by ganjisaffar@gmail.com on 2 Jan 2012 at 3:59

Changed state: Done

mohankreddy / crawler4j

How to Stop crawler and then restarting web crawler with different seeds #22