Closed GoogleCodeExporter closed 9 years ago
Also is it possible to resume crawling with different depth or thread?
Original comment by utkuson...@yahoo.com
on 21 May 2011 at 10:22
Yes, if you checkout the latest version from svn you can enable resuming in the
crawler4j.properties file and then resume crawling with different seeds, depth,
threads, ...
-Yasser
Original comment by ganjisaffar@gmail.com
on 21 May 2011 at 4:07
Hi,
Is there any example on how can I stop the crawler, modify the seeds and start
the crawler again during runtime ? Do I have to use another controller ?
I already did a checkout the r21 from subversion. I also enable the resume
feature in the crawler4j.properties. I can start the crawler with a seed, but
if I try to add a new seed to the controller, it throws an exception. I want to
remove the old seed and add a new one.
Thanks.
Original comment by asiem...@gmail.com
on 23 May 2011 at 6:28
I have to add that I too can not get the resume feature to work. Here is what
I get when I attempt to start from a new root:
java.lang.IllegalThreadStateException
at java.lang.Thread.start(Unknown Source)
at edu.uci.ics.crawler4j.crawler.PageFetcher.startConnectionMonitorThread(PageFetcher.java:124)
at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:82)
at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:56)
Original comment by grjas...@gmail.com
on 12 Jun 2011 at 4:08
I met the same problem,too.
java.lang.IllegalThreadStateException
at java.lang.Thread.start(Thread.java:595)
at edu.uci.ics.crawler4j.crawler.PageFetcher.startConnectionMonitorThread(PageFetcher.java:124)
at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:82)
at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:56)
Original comment by pugdo...@gmail.com
on 17 Jul 2011 at 7:22
Any solution to this feature ? Is it possible to run multiple crawlers with
different seeds and configuration ? Do I need to define separate controllers ?
Original comment by sham...@gmail.com
on 20 Sep 2011 at 6:51
In the current version, you can have only one instance of the crawler in a
single process. Still, you can initiate different processes for different
crawls.
-Yasser
Original comment by ganjisaffar@gmail.com
on 22 Sep 2011 at 4:17
you can use controller.start(MyCrawler.class, 1); for single thread crawler.
Original comment by mishra....@gmail.com
on 24 Sep 2011 at 6:00
@Yasser will it be possible to start more than 1 crawler per process in the
future?
That would be very helpfully :-)
Original comment by seoprogr...@googlemail.com
on 27 Sep 2011 at 7:52
i created a shell-skript that reads a list of domains (from the file
crawling_domains.txt) for crawling and starts a new java-process:
#!/bin/bash
# path variabel
path=$1
if [ $1 ]
then path=$1
else path="/opt/openseodata/"
fi
inputDatei="../../crawling_domains.txt"
# DomainCrawlerJob
cd $path"jobs/DomainCrawlerJob/"
rm nohup.out
# prozesse starten
cat $inputDatei | while read a; do
java -Xmx2048m -Dcrawler.domains=$a -DanzahlThreads=20 -jar
../openseodata-jobs-jar-with-dependencies.jar DomainCrawlerJob > nohup.out
done
Original comment by seoprogr...@googlemail.com
on 27 Sep 2011 at 11:13
Yes, I have added this feature to my top priority list for next release
(multiple crawlers in the same process).
-Yasser
Original comment by ganjisaffar@gmail.com
on 28 Sep 2011 at 1:56
As of version 3.0, this feature is fully supported. You can enable resumable
crawling, add seeds during crawling, add seeds on the second run, ...
-Yasser
Original comment by ganjisaffar@gmail.com
on 2 Jan 2012 at 3:59
Original issue reported on code.google.com by
mishra....@gmail.com
on 16 Dec 2010 at 8:38