scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 216 forks source link

Keyword BACKEND Meaning Inconsistent Between Spider and Workers #318

Open grammy-jiang opened 6 years ago

grammy-jiang commented 6 years ago

Hi, there,

I am working on Frontera these days, and Frontera is a great tool for cluster crawling!

But I still find there is something not that easy to understand/figure out, because of the lack of documentation. After reading and trying the settings mentioned in the Cluster setup guide — Frontera 0.7.1 documentation, I notice that the meanings of the keyword BACKEND are inconsistent between spider and worker:

I do not understand the purpose of this design: the inconsistent meaning would mislead users to set this keyword in both spiders and workers.

Would anyone tell me the reason for this design? Or is it just a mistake?

sibiryakov commented 6 years ago

Hi @grammy-jiang it's quite an interesting finding. The thing is Frontera tries to be both a distributed and non-distributed crawl frontier framework. And backend became a place in internal architecture allowing to do this, by effectively moving the storage backend to some other process by means of MessageBusBackend.

Here http://frontera.readthedocs.io/en/latest/topics/architecture.html#single-process you can find more information.

The second reason is this happened historically. Frontera started as non-distributed framework, and that left some architectural artefacts.

I agree this is misleading. You can propose your variant how to organise these components to make them easier to understand and use.

grammy-jiang commented 6 years ago

@sibiryakov Thanks for your reply!

Emmm, I only use Frontera in cluster mode and did not read other parts carefully in the documentation. Frontera is a fantastic framework for cluster crawling, but the documentation is not clear enough like scrapy.

I am a scrapy heavy user and write some useful middlewares (both spider and downloader, also with unit test cases), and most of them have published on my GitHub page. I would like to contribute these codes back to the community, but I do not know how to do it. Would you please review my code and mentor me how to contribute?