hellock / icrawler

A multi-thread crawler framework with many builtin image crawlers provided.
http://icrawler.readthedocs.io/en/latest/
MIT License
854 stars 174 forks source link

Infinite loop even if work finished #26

Closed gajewsk2 closed 7 years ago

gajewsk2 commented 7 years ago

crawler.py

        while True:
            if threading.active_count() <= 1:
                break

Crawler never stops if there were already more than 1 threads runnin, eg if you are running this on a web server, it will not end. I've simply disabled this line to get things working fine.

hellock commented 7 years ago

Hi @gajewsk2 , I've also tested icrawler on web servers and it ends as expected. There may be other reasons for not exiting. Usually there will be only one thread alive after all tasks are finished and the parent thread will exit.

gajewsk2 commented 7 years ago

Since it's relying on the threading library, which I believe is acting as a singleton, it says I have 3 running before I even start the crawler on my django app. Basically threading is a global variable and my environment isn't letting my server terminate. Is there a reason to keep this check if the work is done and the threads the crawler has launched are reaped?

hellock commented 7 years ago

It may be better to use the exiting of downloader as a condition to terminate all crawling threads. Checking the thread num is not necessary indeed.

danakianfar commented 7 years ago

@hellock, I have this issue when using GreedyImageCrawler. The loop never terminates and keeps logging.

2017-08-11 00:52:45,864 - INFO - downloader - downloader-001 is waiting for new download tasks
2017-08-11 00:52:46,624 - INFO - parser - parser-001 is waiting for new page urls
2017-08-11 00:52:48,625 - INFO - parser - parser-001 is waiting for new page urls
2017-08-11 00:52:50,625 - INFO - parser - parser-001 is waiting for new page urls
2017-08-11 00:52:50,865 - INFO - downloader - downloader-001 is waiting for new download tasks