scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 216 forks source link

seed urls not loading #363

Open arun477 opened 5 years ago

arun477 commented 5 years ago

i'm trying frontera example general crawler but its not taking urls from seed. i'm getting following output in terminal.

2019-02-02 18:47:39 [manager] DEBUG: GET_NEXT_REQUESTS(out) returned_requests=0
2019-02-02 18:47:49 [manager] DEBUG: GET_NEXT_REQUESTS(in) max_next_requests=256
2019-02-02 18:47:49 [overusedbuffer] DEBUG: Overused keys: []
2019-02-02 18:47:49 [overusedbuffer] DEBUG: Pending: 0
arun477 commented 5 years ago

i used following command to add seed urls. python3 -m frontera.utils.add_seeds --config logging --seeds-file ./seeds_es_smp.txt output: 2019-02-02 18:45:46,219 INFO __main__ Starting local seeds addition from file ./seeds_es_smp.txt 2019-02-02 18:45:46,219 INFO manager -------------------------------------------------------------------------------- 2019-02-02 18:45:46,219 INFO manager Starting Frontier Manager... 2019-02-02 18:45:46,222 INFO manager Frontier Manager Started! 2019-02-02 18:45:46,222 INFO manager -------------------------------------------------------------------------------- 2019-02-02 18:45:46,232 INFO states-context Flushing states 2019-02-02 18:45:46,232 INFO states-context Flushing of states finished 2019-02-02 18:45:46,232 INFO states-context Flushing states 2019-02-02 18:45:46,232 INFO states-context Flushing of states finished 2019-02-02 18:45:46,232 INFO __main__ Seeds addition finished

sibiryakov commented 5 years ago

Hi, which backend have yo used @coolarun ?

arun477 commented 5 years ago

Hi sibiryakov, i tried with in memory db. frontera documentation for integrating with scrapy was good but adding seed url directly through code instead of command line is not clearly mentioned any where in the docs. point me to some resources would be helpfull. thnks. (NOTE: python3, scrapy, frontera, load urls from text file and crawl)

sibiryakov commented 5 years ago

This will not work with memory db, because there is nowhere to store the seeds/queue, etc. https://frontera.readthedocs.io/en/latest/topics/quick-start-single.html#inject-the-seed-urls Check also quick start distributed, if you're doing distributed setup.

arun477 commented 5 years ago

thanks. is there any way to add seed urls through code instead of command line.

sibiryakov commented 5 years ago

By using the crawling strategy. There is the whole guide about it https://frontera.readthedocs.io/en/latest/topics/custom_crawling_strategy.html The idea is that your crawling strategy has a logic of adding the seeds. If you describe what problem you're trying to solve with Frontera, I could suggest something more specific.

arun477 commented 5 years ago

thanks for the resource link sibiryakov. my use case is i have more than 10k seed urls and i want to crawl all using "breadth first search" strategy. the problem i'm facing with just "scrapy" alone is each site i try to crawl is so huge. seems like it never gonna finish soon. to solve this issue i try to eliminate all unwanted urls during crawling itself and order the urls which needs to be crawled first for this i need to perform lot of classification and filtering during crawling itself.