scalingexcellence / scrapybook

Scrapy Book Code
http://scrapybook.com/
475 stars 209 forks source link

Question: Self hosted Scrapinghub? #9

Closed inkrement closed 8 years ago

inkrement commented 8 years ago

I really like your book, but I have a question: Are there any possibilities to self-host scraping hub? A lot of people have their own infrastructure and so it would be nice to use it. Could you recommend some sort of free and open-source scrapy-management tool and Web-UI?

yssoe commented 8 years ago

Hi, just run it on Amazon AWS or any other cloud service.

Cheers,

inkrement commented 8 years ago

You got me wrong. I am already using my own vps. But I don't want to use the terminal all the time to setup virtualenv, crontabs etc. This is quite messy, especially if you have to install and manage a lot of scrapers. So I am looking for a nice gui to install, manage, configure and monitor my scrapers. A self hosted scrapinghub would be perfect, but I was not able to find such a tool.

yssoe commented 8 years ago

Hi, did you tried scrapyd ?

It comes with a webinterface

https://scrapyd.readthedocs.io/en/latest/overview.html#web-interface

I run a fair amount of spiders, and I scripted the deployment of them in Ansible, I only need to run 1 command and it's done.

cheers

lookfwd commented 8 years ago

@inkrement - thank you so much! I'm so glad you like the book :)

One thing I would recommend is talking directly to @pablohoffman. Scrapinghub might be able to provide you with a licence, code or just the right direction to have exactly the system you need.

install, manage, configure and monitor my scrapers

All but the monitor on this list are actually very close to what scrapyd (as @yssoe says) and/or generic infrastructure tools like chef, vagrant or docker provide (relevant tools: 1, 2, 3). For monitoring, indeed, I'm not aware of something strong. The section named "Creating our custom monitoring command" in Chapter 11 gives some clues on how easy it is to implement such functionality. It's all REST + JSON and it should be easy and cost effective to contract someone in upwork to develop something that would exactly fit your needs and potentially opensource it as well. There is indeed a gap.

pablohoffman commented 8 years ago

Hi @inkrement, we have no plans to provide a self-hosted version of Scrapinghub simply because it's too much work to maintain a separate appliance version of our platform (we're a small team!) and we've yet to find: 1. a customer our infrastructure can't accommodate and 2. a customer that is willing to sponsor its development (we're talking north of a couple hundred grand)

I'm curious to understand what your concerns are in regards to running your spiders in Scrapinghub. Would you have the same concerns regarding, say, hosting your web app in Heroku or your code in Github?. Thanks in advance for your insights!

inkrement commented 8 years ago

@yssoe Thanks for your input. Scrapyd looks very promising, I'll take a look at it!

@lookfwd Oh, nice - I skipped that chapter back then, but I will read it. Maybe I will code something too, I studied Software Engineering, so this should not be the problem, but I hoped that there are already some existing tools.

@pablohoffman I have no concerns and I would love to use scrapinghub, but I work for a university and we have our own servers. If I am paying for external infrastructure or services I have to argue why I am not using our own hardware and that's the only reason against it. It's not easy to do that especially because usability is not really a good reason for them.

pablohoffman commented 8 years ago

@inkrement thanks for clarifying, would love to continue the chat offline. you can reach me at pablo in scrapinghub.com