python / cpython

The Python programming language
https://www.python.org
Other
62.29k stars 29.93k forks source link

Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default. #60055

Open 8a76f92f-eaeb-4336-a65a-79cec9841ade opened 12 years ago

8a76f92f-eaeb-4336-a65a-79cec9841ade commented 12 years ago
BPO 15851
Nosy @rhettinger, @terryjreedy, @orsenthil, @ezio-melotti, @karlcow
Files
  • robotparser.py.diff: robotparser.py patch (against the mercurial 2.7 branch).
  • myrobotparser.py: Example
  • test.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.7', 'type-feature', 'library'] title = "Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default." updated_at = user = 'https://bugs.python.org/dualbus' ``` bugs.python.org fields: ```python activity = actor = 'rhettinger' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'dualbus' dependencies = [] files = ['27100', '27101', '27158'] hgrepos = [] issue_num = 15851 keywords = ['patch'] message_count = 15.0 messages = ['169718', '169722', '170006', '170007', '170136', '170137', '170262', '170265', '170274', '183579', '221295', '221316', '221317', '221318', '221327'] nosy_count = 7.0 nosy_names = ['rhettinger', 'terry.reedy', 'orsenthil', 'ezio.melotti', 'karlcow', 'tshepang', 'dualbus'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue15851' versions = ['Python 3.7'] ```

    Linked PRs

    8a76f92f-eaeb-4336-a65a-79cec9841ade commented 12 years ago

    I found that http://en.wikipedia.org/robots.txt returns 403 if the provided user agent is in a specific blacklist.

    And since robotparser doesn't provide a mechanism to change the default user agent used by the opener, it becomes unusable for that site (and sites that have a similar policy).

    I think the user should have the possibility to set a specific user agent string, to better identify their bot.

    I attach a patch that allows the user to change the opener used by RobotFileParser, in case the need of some specific behavior arises.

    I also attach a simple example of how it solves the issue, at least with wikipedia.

    8a76f92f-eaeb-4336-a65a-79cec9841ade commented 12 years ago

    I guess a workaround is to do:

    robotparser.URLopener.version = 'MyVersion'
    terryjreedy commented 12 years ago

    Enhancements can only be targeted at 3.4, where robotparser is now urllib.robotparser

    I wonder if documenting the simple solution would be sufficient.

    terryjreedy commented 12 years ago

    In any case, a doc change *could* go in 2.7 and 3.3/2.

    8a76f92f-eaeb-4336-a65a-79cec9841ade commented 12 years ago

    I'm not sure what's the best approach here.

    1. Avoid changes in the Lib, and document a work-around, which involves installing an opener with the specific User-agent. The draw-back is that it modifies the behaviour of urlopen() globally, so that change affects any other call to urllib.request.urlopen.

    2. Revert to the old way, using an instance of a FancyURLopener (or URLopener), in the RobotFileParser class. This requires a modification of the Lib, but allows us to modify only the behaviour of that specific instance of RobotFileParser. The user could sub-class FancyURLopener, set the appropiate version string.

    I attach a script, tested against the default branch of the mercurial repository. It shows the work around for python3.3.

    8a76f92f-eaeb-4336-a65a-79cec9841ade commented 12 years ago

    I forgot to mention that I ran a nc process in parallel, to see what data is being sent: nc -l -p 9999.

    orsenthil commented 12 years ago

    Hello Eduardo,

    I fail to see the bug in here. Robotparser module is for reading and parsing the robot.txt file, the module responsible for fetching it could urllib. robots.txt is always available from web-server and you can download the robot.txt by any means, even by using robotparser.read by providing the full url to robots.txt. You do not need to set user-agent to read/fetch the robots.txt file. Once fetched, now when you are crawling the site using your custom written crawler or using urllib, you can honor the User-Agent requirement by sending proper headers with your request. That can be done using urllib module itself and there is documentation on adding headers I believe.

    I think, this is way most folks would be (or I believe are ) using it. Am I missing something? If my above explanation is okay, then we can close this bug as invalid.

    Thanks, Senthil

    8a76f92f-eaeb-4336-a65a-79cec9841ade commented 12 years ago

    Hi Senthil,

    I fail to see the bug in here. Robotparser module is for reading and parsing the robot.txt file, the module responsible for fetching it could urllib.

    You're right, but robotparser's read() does a call to urllib.request.urlopen to fetch the robots.txt file. If robotparser took a file object, or something like that instead of a Url, I wouldn't think of this as a bug, but it doesn't. The default behaviour is for it to fetch the file itself, using urlopen.

    Also, I'm aware that you shouldn't normally worry about setting a specific user-agent to fetch the file. But that's not the case of Wikipedia. In my case, Wikipedia returned 403 for the urllib user-agent. And since there's no documented way of specifying a particular user-agent in robotparser, or to feed a file object to robotparser, I decided to report this.

    Only after reading the source of 2.7.x and 3.x, one can find work-arounds for that problem, since it's not really clear how these make the requests for the robots.txt in the documentation.

    orsenthil commented 12 years ago

    Hi Eduardo,

    I tested further and do observe some very strange oddities.

    On Mon, Sep 10, 2012 at 10:45 PM, Eduardo A. Bustamante López \report@bugs.python.org\ wrote:

    Also, I'm aware that you shouldn't normally worry about setting a specific user-agent to fetch the file. But that's not the case of Wikipedia. In my case, Wikipedia returned 403 for the urllib user-agent.

    Yeah, this really surprised me. I would normally assume robots.txt to be readable by any agent, but I think something odd is happening.

    In 2.7, I do not see the problem because, the implementation is:

    import urllib
    
    class URLOpener(urllib.FancyURLopener):
        def __init__(self, *args):
            urllib.FancyURLopener.__init__(self, *args)
            self.errcode = 200
    
    opener = URLOpener()
    fobj = opener.open('http://en.wikipedia.org/robots.txt')
    print opener.errcode

    This will print 200 and everything is fine. Also, look at it that robots.txt is accessible.

    In 3.3, the implementation is:

    import urllib.request

    try: fobj = urllib.request.urlopen('http://en.wikipedia.org/robots.txt') except urllib.error.HTTPError as err: print(err.code)

    This gives 403. I would normally expect this to work without any issues. But according to my analysis, what is happening is when the User-agent is set to something which has '-' in that, the server is rejecting it with 403.

    In the above code, what is happening underlying is this:

    import urllib.request
    opener = urllib.request.build_opener()
    opener.addheaders = [('User-agent', 'Python-urllib/3.3')]
    fobj = opener.open('http://en.wikipedia.org/robots.txt')
    print(fobj.getcode())

    This would give 403. In order to see it work, change the addheaders line to

    opener.addheaders = [('', '')]
    opener.addheaders = [('User-agent', 'Pythonurllib/3.3')]
    opener.addheaders = [('User-agent', 'KillerSpamBot')]

    All should work (as expected).

    So, thing which surrprises me is, if sending "Python-urllib/3.3" is a mistake for "THAT Server". Is this a server oddity at Wikipedia part? ( Coz, I refered to hg log to see from when we are sending Python-urllib/version and it seems that it's being sent for long time).

    Can't see how should this be fixed in urllib.

    b5530672-af90-4755-8dcc-6d0806f6cc01 commented 11 years ago

    Setting a user agent string should be possible. My guess is that the default library has been used by an abusive client (by mistake or intent) and wikimedia project has decided to blacklist the client based on the user-agent string sniffing.

    The match is on anything which matches

    "Python-urllib" in UserAgentString

    See below:

    >>> import urllib.request
    >>> opener = urllib.request.build_opener()
    >>> opener.addheaders = [('User-agent', 'Python-urllib')]
    >>> fobj = opener.open('http://en.wikipedia.org/robots.txt')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 479, in open
        response = meth(req, response)
      File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 591, in http_response
        'http', request, response, code, msg, hdrs)
      File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 517, in error
        return self._call_chain(*args)
      File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 451, in _call_chain
        result = func(*args)
      File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 599, in http_error_default
        raise HTTPError(req.full_url, code, msg, hdrs, fp)
    urllib.error.HTTPError: HTTP Error 403: Forbidden
    >>> import urllib.request
    >>> opener = urllib.request.build_opener()
    >>> opener.addheaders = [('User-agent', 'Pythonurllib/3.3')]
    >>> fobj = opener.open('http://en.wikipedia.org/robots.txt')
    >>> fobj
    <http.client.HTTPResponse object at 0x101275850>
    >>> import urllib.request
    >>> opener = urllib.request.build_opener()
    >>> opener.addheaders = [('User-agent', 'Pyt-honurllib/3.3')]
    >>> fobj = opener.open('http://en.wikipedia.org/robots.txt')
    >>> import urllib.request
    >>> opener = urllib.request.build_opener()
    >>> opener.addheaders = [('User-agent', 'Python-urllib')]
    >>> fobj = opener.open('http://en.wikipedia.org/robots.txt')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 479, in open
        response = meth(req, response)
      File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 591, in http_response
        'http', request, response, code, msg, hdrs)
      File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 517, in error
        return self._call_chain(*args)
      File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 451, in _call_chain
        result = func(*args)
      File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 599, in http_error_default
        raise HTTPError(req.full_url, code, msg, hdrs, fp)
    urllib.error.HTTPError: HTTP Error 403: Forbidden
    >>> import urllib.request
    >>> opener = urllib.request.build_opener()
    >>> opener.addheaders = [('User-agent', 'Python-urlli')]
    >>> fobj = opener.open('http://en.wikipedia.org/robots.txt')
    >>> 

    Being able to change the header might indeed be a good thing.

    83d2e70e-e599-4a04-b820-3814bbdb9bef commented 10 years ago

    The code given in msg183579 works perfectly in 3.4.1 and 3.5.0. Is there anything to fix here whether code or docs?

    b5530672-af90-4755-8dcc-6d0806f6cc01 commented 10 years ago

    Mark,

    The code is using urllib for demonstrating the issue with wikipedia and other sites which are blocking python-urllib user agents because it is used by many spam harvesters.

    The proposal is about giving a possibility in robotparser lib to add a feature for setting up the user-agent.

    b5530672-af90-4755-8dcc-6d0806f6cc01 commented 10 years ago

    Note that one of the proposal is to just document in https://docs.python.org/3/library/urllib.robotparser.html the proposal made in msg169722 (available in 3.4+)

        robotparser.URLopener.version = 'MyVersion'
    83d2e70e-e599-4a04-b820-3814bbdb9bef commented 10 years ago

    c:\cpython\PCbuild>python_d.exe -V Python 3.5.0a0

    c:\cpython\PCbuild>type C:\Users\Mark\MyPython\mytest.py

    !/usr/bin/env python3

    # -- coding: latin-1 --

    import urllib.request
    opener = urllib.request.build_opener()
    opener.addheaders = [('User-agent', 'Python-urllib')]
    fobj = opener.open('http://en.wikipedia.org/robots.txt')
    print('Finished, no traceback here')

    c:\cpython\PCbuild>python_d.exe C:\Users\Mark\MyPython\mytest.py Finished, no traceback here

    b5530672-af90-4755-8dcc-6d0806f6cc01 commented 10 years ago
    → python
    Python 2.7.5 (default, Mar  9 2014, 22:15:05) 
    [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import robotparser
    >>> rp = robotparser.RobotFileParser('http://somesite.test.site/robots.txt')
    >>> rp.read()
    >>> 

    Let's check the server logs:

    127.0.0.1 - - [23/Jun/2014:08:44:37 +0900] "GET /robots.txt HTTP/1.0" 200 92 "-" "Python-urllib/1.17"

    Robotparser by default was using in 2.* the Python-urllib/1.17 user agent which is traditionally blocked by many sysadmins. A solution has been already proposed above:

    This is the proposed test for 3.4

    import urllib.robotparser
    import urllib.request
    opener = urllib.request.build_opener()
    opener.addheaders = [('User-agent', 'MyUa/0.1')]
    urllib.request.install_opener(opener)
    rp = urllib.robotparser.RobotFileParser('http://localhost:9999')
    rp.read()

    The issue is not anymore about changing the lib, but just about documenting on how to change the RobotFileParser default UA. We can change the title of this issue if it's confusing. Or close it and open a new one for documenting what makes it easier :)

    Currently robotparser.py imports urllib user agent. http://hg.python.org/cpython/file/7dc94337ef67/Lib/urllib/request.py#l364

    It's a common failure we encounter when using urllib in general, including robotparser.

    As for wikipedia, they fixed their server side user agent sniffing, and do not filter anymore python-urllib.

    GET /robots.txt HTTP/1.1 Accept: */* Accept-Encoding: gzip, deflate, compress Host: en.wikipedia.org User-Agent: Python-urllib/1.17

    HTTP/1.1 200 OK Accept-Ranges: bytes Age: 3161 Cache-control: s-maxage=3600, must-revalidate, max-age=0 Connection: keep-alive Content-Encoding: gzip Content-Length: 5208 Content-Type: text/plain; charset=utf-8 Date: Sun, 22 Jun 2014 23:59:16 GMT Last-modified: Tue, 26 Nov 2013 17:39:43 GMT Server: Apache Set-Cookie: GeoIP=JP:Tokyo:35.6850:139.7514:v4; Path=/; Domain=.wikipedia.org Vary: X-Subdomain Via: 1.1 varnish, 1.1 varnish, 1.1 varnish X-Article-ID: 19292575 X-Cache: cp1065 miss (0), cp4016 hit (1), cp4009 frontend hit (215) X-Content-Type-Options: nosniff X-Language: en X-Site: wikipedia X-Varnish: 2529666795, 2948866481 2948865637, 4134826198 4130750894

    Many other sites still do. :)

    serhiy-storchaka commented 6 months ago

    @orsenthil, what is your opinion about allowing to use pre-created Request object instead of URL? #103753 looks almost ready (it only lacks documentation), but I am not sure that it is the right solution.

    orsenthil commented 6 months ago

    but I am not sure that it is the right solution.

    It is complicating things a bit. For an archaic protocol like urllib.robotparser.RobotParser and requesting to provide Request object seems like a overkill. If the same can accomplished using strings, that will be preferable.