python / cpython

The Python programming language
https://www.python.org
Other
62.1k stars 29.85k forks source link

robotparser should support specifying SSL context #87763

Open 8917e0d2-03f5-417a-8c58-2b9b496898d5 opened 3 years ago

8917e0d2-03f5-417a-8c58-2b9b496898d5 commented 3 years ago
BPO 43597
Nosy @berkerpeksag, @Tchinmai7
PRs
  • python/cpython#24984
  • python/cpython#24986
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'library', '3.10'] title = 'robotparser should support specifying SSL context' updated_at = user = 'https://github.com/Tchinmai7' ``` bugs.python.org fields: ```python activity = actor = 'Tchinmai7' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'Tchinmai7' dependencies = [] files = [] hgrepos = [] issue_num = 43597 keywords = ['patch'] message_count = 3.0 messages = ['389352', '390395', '390396'] nosy_count = 2.0 nosy_names = ['berker.peksag', 'Tchinmai7'] pr_nums = ['24984', '24986'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue43597' versions = ['Python 3.10'] ```

    8917e0d2-03f5-417a-8c58-2b9b496898d5 commented 3 years ago

    IMO this could be enhanced by adding a sslcontext parameter to read method

    a sample change would it could look like

    def read(self, sslcontext=None):
        """Reads the robots.txt URL and feeds it to the parser."""
        try:
            if sslcontext:
               f = urllib.request.urlopen(self.url, context=sslcontext)
            else:
               f = urllib.request.urlopen(self.url)
        except urllib.error.HTTPError as err:
            if err.code in (401, 403):
                self.disallow_all = True
            elif err.code >= 400 and err.code < 500:
                self.allow_all = True
        else:
            raw = f.read()
            self.parse(raw.decode("utf-8").splitlines())
    

    Happy to send a PR if this proposal makes sense.

    berkerpeksag commented 3 years ago

    I'm not opposing to the idea, but what's the practical use case here? I haven't seen a case where you needed to pass a custom SSLContext in order to fetch the robots.txt file.

    8917e0d2-03f5-417a-8c58-2b9b496898d5 commented 3 years ago

    I am writing a web scraper, that runs in a container that has CA-Certificates stored in a non-standard location. The Ca-Certificates are managed by Certifi. By allowing to override the sslcontext, it is possible for the user to construct a sslcontext and pass it in.