ValueError when parsing Chromium preferences

0x4d4c commented 8 months ago

Describe the problem When parsing an almost pristine Chromium profile on Fedora Linux, Hindsight crashes with an uncaught exception raised by urllib's parsing functions.

The exact occurrence is in the build_md5_hash_list_of_origins method of the WebBrowser class. However, I guess that it might occur on any place where the following functionality is implemented:

for artifact in self.parsed_artifacts:
   # [...snip...]
   domain = urllib.parse.urlparse(artifact.url).hostname
   # [...snap...]

In my quick and dirty tests it was sufficient to wrap the calls to urlparse in a try-except block in the build_md5_hash_list_of_origins method of the WebBrowser class and build_hsts_domain_hashes in the Chrome class. In the except, I just set the domain to None, which solves the problem of Hindsight crashing but might lead to missing URLs in the resulting lists.

Screenshots or Console Output

$ hindsight.py -i ~/.config/chromium/Default/ -f jsonl

################################################################################

                   _     _           _     _       _     _
                  | |   (_)         | |   (_)     | |   | |
                  | |__  _ _ __   __| |___ _  __ _| |__ | |_
                  | '_ \| | '_ \ / _` / __| |/ _` | '_ \| __|
                  | | | | | | | | (_| \__ \ | (_| | | | | |_
                  |_| |_|_|_| |_|\__,_|___/_|\__, |_| |_|\__|
                                              __/ |
                        by @_RyanBenson      |___/ v2023.03

################################################################################

       Start time: 2024-03-01 12:44:10.443
  Input directory: /home/dfir/.config/chromium/Default/
      Output name: Hindsight Report (2024-03-01T12-44-10).jsonl

 Processing:

    Profile: /home/dfir/.config/chromium/Default/
                     Detected Chrome version:            [    111 ]            
                                 URL records:            [      8 ]            
                            Download records:            [      0 ]            
                           GPU Cache records:            [      0 ]            
                              Cookie records:            [      8 ]            
                            Autofill records:            [      0 ]            
                            Bookmark records:            [      1 ]            
                       Local Storage records:            [     14 ]            
                     Session Storage records:            [     10 ]            
                                  Extensions:            [      1 ]            
                          Login Data records:            [      0 ]            
                            Preference Items:            [     19 ]            
Traceback (most recent call last):
  File "/home/dfir/git/hindsight/./hindsight.py", line 338, in <module>
    main()
  File "/home/dfir/git/hindsight/./hindsight.py", line 212, in main
    run_status = analysis_session.run()
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dfir/git/hindsight/pyhindsight/analysis.py", line 529, in run
    browser_analysis.process()
  File "/home/dfir/git/hindsight/pyhindsight/browsers/chrome.py", line 2514, in process
    self.get_site_characteristics(self.profile_path, 'Site Characteristics Database')
  File "/home/dfir/git/hindsight/pyhindsight/browsers/chrome.py", line 2157, in get_site_characteristics
    self.build_md5_hash_list_of_origins()
  File "/home/dfir/git/hindsight/pyhindsight/browsers/webbrowser.py", line 117, in build_md5_hash_list_of_origins
    domain = urllib.parse.urlparse(artifact.url).hostname
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/urllib/parse.py", line 395, in urlparse
    splitresult = urlsplit(url, scheme, allow_fragments)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/urllib/parse.py", line 500, in urlsplit
    _check_bracketed_host(bracketed_host)
  File "/usr/lib64/python3.12/urllib/parse.py", line 446, in _check_bracketed_host
    ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/ipaddress.py", line 54, in ip_address
    raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
ValueError: '*.' does not appear to be an IPv4 or IPv6 address

Expected behavior Per RFC 3986 (3.2.2), square brackets are only allowed to denote IP literals. Chromium does not adhere to this specification, so that urllib does the correct thing here. In my opinion, Hindsight should check the key under .profile.content_settings.exceptions.cookie_controls_metadata and ensure that there are either no square brackets or normalize the URLs to be parsable by urllib.

To Reproduce Steps to reproduce the behavior:

Put the Preferences.gz file into the profile folder
Run hindsight.py -i ~/.config/chromium/Default/
See error

hindsight.log Snippet The log doesn't give any additional information but I'm happy to provide it if necessary.

System Details

Analysis System OS: Fedora Linux 39 Workstation
Method of Running Hindsight: hindsight.py using Python 3.12.2 on Fedora Linux 39
Hindsight version: Commit f186013 from GitHub and 20230327.0 from PyPI
Target System OS: Fedora Linux 39 Workstation
Target Browser: Chromium
Target Browser Version: 122.0.6261.69

Additional context None. If you need more context let me know.

obsidianforensics commented 8 months ago

Thanks for the detailed bug report! I'll take a look.

Rivers-dev commented 8 months ago

This bug can be fixed by changing the format of artifact_url and artifact.url. in webbrowser.py and chrome.py:

In webbrowser.py:

    def build_md5_hash_list_of_origins(self):
        for artifact in self.parsed_artifacts:
            if isinstance(artifact, self.HistoryItem):
                # Bug here - does not work due to URL parsing - PATCHED
                if type(artifact.url) is not str:
                    return
                domain = urllib.parse.urlparse(artifact.url).hostname
                # Some URLs don't have a domain, like local PDF files
                if domain:
                    self.origin_hashes[hashlib.md5(domain.encode()).hexdigest()] = domain

In chrome.py:

    def build_hsts_domain_hashes(self):
        domains = set()
        for artifact in self.parsed_artifacts:
            if isinstance(artifact, self.HistoryItem):
                artifact_url = artifact.url

                if not artifact_url:
                    continue

                # Cookie artifact's "URLs" will be in the form ".example.com",
                # which won't parse, so modify it so it will
                if artifact_url and artifact_url.startswith('.'):
                    for i in range(len(artifact_url)):
                        if artifact_url[i] == '*':
                            artifact_url[i] = ''
                    artifact_url = 'http://' + artifact_url[1:]
                if type(artifact_url) == str:
                    artifact_url_cleaned = artifact_url.split('*')
                    #domain = urllib.parse.urlparse(url_list, scheme='https').hostname 
                # Same URL problem - PATCHED
                # Some URLs don't have a domain, like local PDF files
                if artifact_url_cleaned:
                    #domains.add(artifact_url_cleaned)
                    for url in artifact_url_cleaned:
                        domains.add(url)

I am not sure if this fix retains all functionality, but it at least lets it run and produces correct output on my end.

obsidianforensics commented 7 months ago

Thanks again for the report and the comments! Should be fixed now.

nikelborm commented 3 months ago

Hi, I'm just installed pyhindsight package using pip and just got this error. Did you republish the package with the fix, or the fix wasn't sufficient?

obsidianforensics / hindsight

ValueError when parsing Chromium preferences #162