Closed GoogleCodeExporter closed 9 years ago
Hi,
Thanks for your detailed comment. As I understand you want to have two separate
crawls one after the other. Is that right? I mean, you don't need both of them
to run at the same time?
-Yasser
Original comment by ganjisaffar@gmail.com
on 24 Jun 2011 at 3:15
Actually, the application with which I worked enabled in its original design
(with websphinx library) to use the crawlers completely independently. There
could be any number of crawlers (with any number of threads) crawling separate
webs, then indexing them using Lucene and storing the values into different
indexes... Yes I know that in reality the number of crawls would be limited by
system resources...
I think that with websphinx it was possible - every crawler was completely
independent... It was started, it worked, it ended. Nothing was left. Another
crawl could be started. Or I think there could be more completely independent
crawls at the same time (I am not 100% sure, but I think it was so).
My (I guess wrong) attitude with crawler4j was the same - I threw away
websphinx, plugged in crawler4j and expected it will work in the same way... It
did not...
After I restarted the crawl, it crashed... I guess running the multiple crawls
at the same time would be also bad idea (though I did not try)...
The thing is that crawl4j is not fully dynamic... There are static structures,
it is expected that some things exist just once in one Java program...
Well, I write this just as an explanation... My case was a kind of emergency
which I kind of overcome by some hacks (like moving some initialization from
"static" to a function and calling it), so it is not that I would be in dire
need myself... Just I think that having these things would help crawler4j to be
able to be integrated in bigger programs (which want to use the crawling
functionality as a submodule - without thinking whether it is possible to
instantiate it more times or whether more instances can run concurrently).
Again as I wrote before - I hope I do not sound rude or so... Crawler4j is a
nice program... I am just trying to say what is (at least from my point of
view, I guess opinions can differ) the biggest problem when it should be
integrated in a bigger program (which I guess is the target - crawling on its
own is probably not very useful - the real usability comes with usage of the
crawled results - e.g. by indexing the crawled pages etc.).
Original comment by mirekl...@gmail.com
on 26 Jun 2011 at 7:37
It is too matter. In fact, it is so simple to solve the problem: Add the
following code to PageFetcher's stopConnectionMonitorThread().
connectionMonitorThread = null;
So now, the method is the following:
public synchronized static void stopConnectionMonitorThread() {
if (connectionMonitorThread != null) {
connectionManager.shutdown();
connectionMonitorThread.shutdown();
//由于该变量是静态变量,如果不重启程序重复执行时,
//该变量始终是同一个Thread,而start同一个Thread是错误的。
//因此需要将其清除。
//SunshineTech@sohu.com 2011-08-19 13:00
connectionMonitorThread = null;
}
}
I think it is a bug. When execute several craw job in sequence, new
CrawlController and start it, it then throw a exception. Because
connectionMonitorThread is static, it mean that it is the same thread, but
start (in startConnectionMonitorThread()) repeatly the same thread is a error.
In start() of Thread class, it will throw a exception:
if (threadStatus != 0 || this != me)
throw new IllegalThreadStateException();
Original comment by Sunshine...@sohu.com
on 19 Aug 2011 at 7:08
Nullifying the connectionMonitorThread solves the problem with the exception,
but is it possible to make crawler4j forget the URLs it has crawled on previous
job and re-crawl?
What I'm aiming at is scheduling periodic crawling jobs with the same seed
URLs, eg. to check for website updates every week, by an application that runs
constantly.
Original comment by j.bluew...@gmail.com
on 9 Sep 2011 at 6:18
After an investigation I realized that it was not really the crawler logic that
prevented from re-crawling, but a silent exception in PageFetcher#fetch(...)
(HTTPClient threw IllegalStateException: connection pool shut down).
I would strongly suggest avoiding silent exceptions (catch block should *never*
be empty), a WARN or ERROR level log perhaps?
My problem has been solved by moving the connection monitor thread control
outside of CrawlController. But future versions of crawler4j should probably be
redesigned to avoid this kind of issues.
Original comment by j.bluew...@gmail.com
on 9 Sep 2011 at 7:40
Thanks for your comment. I'm going to redesign crawler4j to consider this and
will try to release the next version in a month.
-Yasser
Original comment by ganjisaffar@gmail.com
on 10 Sep 2011 at 1:10
Yasser, thank you for all your work thus far on crawler4j.
I have a similar use case to the person that started this thread i.e. run
multiple, independent crawls in parallel. I started looking into refactoring
crawler4j but it looks like it goes deeper than static code. Once I hit the
sleepycat DB stuff (which I haven't used before) I decided to wait and see how
Yassers redesign goes.
Looking forward to any updates.
Original comment by jasonj.o...@gmail.com
on 15 Sep 2011 at 2:06
As of version 3.0, this feature is supported. See
http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawl
er4j/examples/multiple/ for a sample.
-Yasser
Original comment by ganjisaffar@gmail.com
on 2 Jan 2012 at 7:09
Original issue reported on code.google.com by
mirekl...@gmail.com
on 24 Jun 2011 at 7:35