jazzband / django-robots

A Django app for managing robots.txt files following the robots exclusion protocol
https://django-robots.readthedocs.io
BSD 3-Clause "New" or "Revised" License
459 stars 99 forks source link

Feature request: @block_robots decorator for views #42

Open groovecoder opened 9 years ago

groovecoder commented 9 years ago

It would be nice if django-robots included a decorator to block robots from views based on User-agent (like robots.txt). It would help django apps outright prevent robots - even mis-behaving ones that don't follow robots.txt - from accessing views that they shouldn't.

SalahAdDin commented 9 years ago

:+1:

yakky commented 8 years ago

I think it's out the scope of this application IMHO a decorator that blocks "rogue" robots to access a view is an application in itself as you need to implement and maintain the list of robots UA string (and even then i doubt 'rogue' robots uses a specific UA string)

some1ataplace commented 1 year ago
  1. In the django-robots project directory, create a new file called block_robots.py with the following code:
import re
from functools import wraps
from django.http import HttpResponseForbidden

def block_robots(view_func):
    @wraps(view_func)
    def _wrapped_view(request, *args, **kwargs):
        # Update the list of blocked user agents accordingly
        blocked_agents = [
            'Googlebot',
            'Bingbot',
            'Slurp',
            'DuckDuckBot',
            'Baiduspider',
            'YandexBot',
            'Sogou',
            'Exabot',
            'Facebot',
            'ia_archiver'
        ]
        user_agent = request.META.get('HTTP_USER_AGENT', "")

        if any(re.search(agent, user_agent, re.IGNORECASE) for agent in blocked_agents):
            return HttpResponseForbidden("Forbidden for robots")

        return view_func(request, *args, **kwargs)
    return _wrapped_view
  1. Now, you can use the @block_robots decorator in your views.py:
from django.http import HttpResponse
from .block_robots import block_robots

@block_robots
def my_protected_view(request):
    return HttpResponse("This view is protected from robots.")

This code defines a block_robots decorator that first checks whether the User-agent of the incoming request matches any of the blocked agents in the list. If a match is found, an HTTP 403 Forbidden response is returned. If no match is found, the request is allowed to continue to the wrapped view.

Feel free to customize the list of blocked agents according to your requirements. The code uses regular expressions to enable partial matches and case-insensitive search, so you can easily include wildcards in the blocked agents list as needed.

Remember that even though this workaround prevents misbehaving bots from accessing your views, the ideal method of restricting access is still employing a properly configured robots.txt file.


Here is sample code to create a custom decorator @block_robots that will block robots from views based on user-agent:

# views.py
from django.http import HttpResponse, HttpResponseForbidden
from django.conf import settings
from django_robots.decorators import check_robots_txt

def my_view(request):
    # view logic here
    return HttpResponse('This is my view!')

@check_robots_txt
def my_view_with_robot_block(request):
    if robot_blocked(request.META.get('HTTP_USER_AGENT', '')):
        return HttpResponseForbidden()
    # view logic here
    return HttpResponse('This is my view with robot block!')

def robot_blocked(user_agent):
    blocked_robots = getattr(settings, 'BLOCKED_ROBOTS', [])
    return user_agent.lower() in blocked_robots

You would need to define the BLOCKED_ROBOTS list in your Django settings file with the user-agent strings of the robots you want to block. The decorator @check_robots_txt is included to ensure that the view respects the robots.txt file. You can add this decorator to any view you want to respect the robots.txt file, even if it doesn't need to block robots.

Here's an example of how you could define the BLOCKED_ROBOTS in your Django settings file:

settings.py

BLOCKED_ROBOTS = [
    'googlebot',
    'bingbot',
    'yahoo',
    # add more robots here as needed
]

Note that this example is case-insensitive, so any user-agent string containing "googlebot" will be blocked, regardless of whether it's spelled in uppercase or lowercase letters. If you want to make it case-sensitive, you can remove the lower() method in the robot_blocked function.