scrapinghub / scrapyrt

HTTP API for Scrapy spiders
BSD 3-Clause "New" or "Revised" License
832 stars 162 forks source link

Authentication mechanism on the REST API of scrapyrt #68

Open aleroot opened 6 years ago

aleroot commented 6 years ago

Basically I want to prevent unauthorized clients from accessing the scrapyrt API. I would want to secure a scrapyrt API, is there anything built in handling an authorization mechanism ?

What kind of approach do you suggest ?

In addition I would like to understand if there is some mechanism to limit the number of maximum request per single client.

pawelmhm commented 6 years ago

hey @aleroot

is there anything built in handling an authorization mechanism ?

no, nothing built in. You can do it in different ways. One way is to put scrapyrt behind some other webserver, for example nginx and configure rate limiting and auth in nginx.

Other option is to write some python code and overriding scrapyrt default resource.

There is option to create your own "resources" so basically your own request handlers. You can do it by subclassing CrawlResource and overriding some methods, e.g. render_GET then calling super()

adding resources is described here http://scrapyrt.readthedocs.io/en/latest/api.html#resources

for example you can write resource like this

class AleRootCrawlResource(CrawlResource):

    def render_GET(self, request, **kwargs):
        # your code goes here e.g. fetch basic auth header etc
        ...
        return super(AleRootCrawlResource, self).render_GET(
            request, **kwargs)

I'll think about adding some more extensive examples to docs with basic auth header, it could be useful for others.

oscarcontrerasnavas commented 2 years ago

Hi, I know this thread is a bit old but bear with me. I had explored this solution and created my own resource, but when I tried to add it according to the documentation, it was only possible for me by referring to a specific settings.py file in the command line like this:

scrapyrt -S nist_scraper.scrapyrt.settings

and it worked in my local environment, now the main CrawlResorce is the one I coded. But I tried to do the same over Heroku on the Procfile as follows:

web: scrapyrt -S nist_scraper.scrapyrt.settings -i 0.0.0.0 -p $PORT

and the ScrapyRT part still uses the default resource. I do not know if I cannot start Scrapy with variables on Heroku or if there is another way to override the resources safely.

Git here: https://github.com/oscarcontrerasnavas/nist-webbook-scrapyrt-spider