Closed 0x4d4c closed 7 months ago
Thanks for the detailed bug report! I'll take a look.
This bug can be fixed by changing the format of artifact_url
and artifact.url
. in webbrowser.py and chrome.py:
In webbrowser.py:
def build_md5_hash_list_of_origins(self):
for artifact in self.parsed_artifacts:
if isinstance(artifact, self.HistoryItem):
# Bug here - does not work due to URL parsing - PATCHED
if type(artifact.url) is not str:
return
domain = urllib.parse.urlparse(artifact.url).hostname
# Some URLs don't have a domain, like local PDF files
if domain:
self.origin_hashes[hashlib.md5(domain.encode()).hexdigest()] = domain
In chrome.py:
def build_hsts_domain_hashes(self):
domains = set()
for artifact in self.parsed_artifacts:
if isinstance(artifact, self.HistoryItem):
artifact_url = artifact.url
if not artifact_url:
continue
# Cookie artifact's "URLs" will be in the form ".example.com",
# which won't parse, so modify it so it will
if artifact_url and artifact_url.startswith('.'):
for i in range(len(artifact_url)):
if artifact_url[i] == '*':
artifact_url[i] = ''
artifact_url = 'http://' + artifact_url[1:]
if type(artifact_url) == str:
artifact_url_cleaned = artifact_url.split('*')
#domain = urllib.parse.urlparse(url_list, scheme='https').hostname
# Same URL problem - PATCHED
# Some URLs don't have a domain, like local PDF files
if artifact_url_cleaned:
#domains.add(artifact_url_cleaned)
for url in artifact_url_cleaned:
domains.add(url)
I am not sure if this fix retains all functionality, but it at least lets it run and produces correct output on my end.
Thanks again for the report and the comments! Should be fixed now.
Hi, I'm just installed pyhindsight package using pip and just got this error. Did you republish the package with the fix, or the fix wasn't sufficient?
Describe the problem When parsing an almost pristine Chromium profile on Fedora Linux, Hindsight crashes with an uncaught exception raised by urllib's parsing functions.
The exact occurrence is in the
build_md5_hash_list_of_origins
method of theWebBrowser
class. However, I guess that it might occur on any place where the following functionality is implemented:In my quick and dirty tests it was sufficient to wrap the calls to
urlparse
in a try-except block in thebuild_md5_hash_list_of_origins
method of theWebBrowser
class andbuild_hsts_domain_hashes
in theChrome
class. In theexcept
, I just set the domain to None, which solves the problem of Hindsight crashing but might lead to missing URLs in the resulting lists.Screenshots or Console Output
Expected behavior Per RFC 3986 (3.2.2), square brackets are only allowed to denote IP literals. Chromium does not adhere to this specification, so that urllib does the correct thing here. In my opinion, Hindsight should check the key under
.profile.content_settings.exceptions.cookie_controls_metadata
and ensure that there are either no square brackets or normalize the URLs to be parsable by urllib.To Reproduce Steps to reproduce the behavior:
hindsight.py -i ~/.config/chromium/Default/
hindsight.log Snippet The log doesn't give any additional information but I'm happy to provide it if necessary.
System Details
Additional context None. If you need more context let me know.