tasfe / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 1 forks source link

Make the crawler4j repeatably usable without restarting program (remove static) #54

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I needed to quickly replace a crawler in one existing Java program (instead of 
"websphinx" which has memory leaks and it is not being developed anymore). I 
found "crawler4j" and thought it great - till I in the middle of the work found 
out that it extensively uses static variables and it is not fully 
restartartable (I do not mean going on with existing crawl, but rather creating 
again CrawlController etc. - as if starting everything again - as it is now it 
crashes, since there are some things just initialized in static{} sections 
etc... - so "no go" without restarting the whole program...).  I managed some 
quick hacks to make it restartable, though it is not nice and there still 
remain another problem - what if it is necessary to run more instances of the 
program in one process" It is now impossibe since far many things are static...
Well, it can be said I should be quiet - it is for free and if do not like it I 
should shut up and program myself my own better crawler, shouldn't I?
But I still I cannot help myself - this program is full of contrasts - on the 
one hand this crawler has some very nice and well working features, on the 
other hand there are things which would get a student thrown from a school exam 
in object oriented programming (widespread static things meaning that it cannot 
be completely restarted inside of one process and there cannot be multiple 
instances).
Well, there would be other things nice to have (like breaking the crawl - I 
hacked myself something as well), but these would just bonus... I guess more 
pressing is getting rid of those "static" (so solving restartability inside of 
a process and ability to have multiple instances).
I write just to give ideas for future - for my aim the "crawler4j" version I 
hacked myself to be restartable is enough. It would be just nice to see this 
program to get better.

Original issue reported on code.google.com by mirekl...@gmail.com on 24 Jun 2011 at 7:35

GoogleCodeExporter commented 9 years ago
Hi,
Thanks for your detailed comment. As I understand you want to have two separate 
crawls one after the other. Is that right? I mean, you don't need both of them 
to run at the same time?

-Yasser

Original comment by ganjisaffar@gmail.com on 24 Jun 2011 at 3:15

GoogleCodeExporter commented 9 years ago
Actually, the application with which I worked enabled in its original design 
(with websphinx library) to use the crawlers completely independently. There 
could be any number of crawlers (with any number of threads) crawling separate 
webs, then indexing them using Lucene and storing the values into different 
indexes... Yes I know that in reality the number of crawls would be limited by 
system resources...
I think that with websphinx it was possible - every crawler was completely 
independent... It was started, it worked, it ended. Nothing was left. Another 
crawl could be started. Or I think there could be more completely independent 
crawls at the same time (I am not 100% sure, but I think it was so).
My (I guess wrong) attitude with crawler4j was the same - I threw away 
websphinx, plugged in crawler4j and expected it will work in the same way... It 
did not...
After I restarted the crawl, it crashed... I guess running the multiple crawls 
at the same time would be also bad idea (though I did not try)...
The thing is that crawl4j is not fully dynamic... There are static structures, 
it is expected that some things exist just once in one Java program...

Well, I write this just as an explanation... My case was a kind of emergency 
which I kind of overcome by some hacks (like moving some initialization from 
"static" to a function and calling it), so it is not that I would be in dire 
need myself... Just I think that having these things would help crawler4j to be 
able to be integrated in bigger programs (which want to use the crawling 
functionality as a submodule - without thinking whether it is possible to 
instantiate it more times or whether more instances can run concurrently).
Again as I wrote before - I hope I do not sound rude or so... Crawler4j is a 
nice program... I am just trying to say what is (at least from my point of 
view, I guess opinions can differ) the biggest problem when it should be 
integrated in a bigger program (which I guess is the target - crawling on its 
own is probably not very useful - the real usability comes with usage of the 
crawled results - e.g. by indexing the crawled pages etc.).

Original comment by mirekl...@gmail.com on 26 Jun 2011 at 7:37

GoogleCodeExporter commented 9 years ago
It is too matter. In fact, it is so simple to solve the problem: Add the 
following code to PageFetcher's stopConnectionMonitorThread().

connectionMonitorThread = null;

So now, the method is the following:

public synchronized static void stopConnectionMonitorThread() {
    if (connectionMonitorThread != null) {
        connectionManager.shutdown();
        connectionMonitorThread.shutdown();
        //由于该变量是静态变量,如果不重启程序重复执行时,
        //该变量始终是同一个Thread,而start同一个Thread是错误的。
        //因此需要将其清除。
        //SunshineTech@sohu.com 2011-08-19 13:00
        connectionMonitorThread = null;
    }
}

I think it is a bug. When execute several craw job in sequence, new 
CrawlController and start it, it then throw a exception. Because 
connectionMonitorThread is static, it mean that it is the same thread, but 
start (in startConnectionMonitorThread()) repeatly the same thread is a error. 
In start() of Thread class, it will throw a exception:

if (threadStatus != 0 || this != me)
            throw new IllegalThreadStateException();

Original comment by Sunshine...@sohu.com on 19 Aug 2011 at 7:08

GoogleCodeExporter commented 9 years ago
Nullifying the connectionMonitorThread solves the problem with the exception, 
but is it possible to make crawler4j forget the URLs it has crawled on previous 
job and re-crawl?

What I'm aiming at is scheduling periodic crawling jobs with the same seed 
URLs, eg. to check for website updates every week, by an application that runs 
constantly.

Original comment by j.bluew...@gmail.com on 9 Sep 2011 at 6:18

GoogleCodeExporter commented 9 years ago
After an investigation I realized that it was not really the crawler logic that 
prevented from re-crawling, but a silent exception in PageFetcher#fetch(...) 
(HTTPClient threw IllegalStateException: connection pool shut down).

I would strongly suggest avoiding silent exceptions (catch block should *never* 
be empty), a WARN or ERROR level log perhaps?

My problem has been solved by moving the connection monitor thread control 
outside of CrawlController. But future versions of crawler4j should probably be 
redesigned to avoid this kind of issues.

Original comment by j.bluew...@gmail.com on 9 Sep 2011 at 7:40

GoogleCodeExporter commented 9 years ago
Thanks for your comment. I'm going to redesign crawler4j to consider this and 
will try to release the next version in a month.

-Yasser

Original comment by ganjisaffar@gmail.com on 10 Sep 2011 at 1:10

GoogleCodeExporter commented 9 years ago
Yasser, thank you for all your work thus far on crawler4j.

I have a similar use case to the person that started this thread i.e. run 
multiple, independent crawls in parallel. I started looking into refactoring 
crawler4j but it looks like it goes deeper than static code. Once I hit the 
sleepycat DB stuff (which I haven't used before) I decided to wait and see how 
Yassers redesign goes. 

Looking forward to any updates.

Original comment by jasonj.o...@gmail.com on 15 Sep 2011 at 2:06

GoogleCodeExporter commented 9 years ago
As of version 3.0, this feature is supported. See 
http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawl
er4j/examples/multiple/ for a sample.

-Yasser

Original comment by ganjisaffar@gmail.com on 2 Jan 2012 at 7:09