rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.
https://rajatomar788.github.io/pywebcopy/
Other
520 stars 105 forks source link

Exception when processing href tel: #99

Closed serbathome closed 2 years ago

serbathome commented 2 years ago

Traceback (most recent call last): File "/Users/serb/Documents/sources/sitecopy.py", line 3, in save_website( File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/init.py", line 164, in save_website crawler.save_complete(pop=open_in_browser) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/core.py", line 218, in save_complete self.scheduler.handle_resource(self) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 156, in handle_resource return self._handle_resource(resource) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 191, in _handle_resource resource.retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 368, in retrieve return self._retrieve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 456, in _retrieve context = self.extract_children(self.parse()) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 439, in extract_children self.scheduler.handle_resource(ans) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/schedulers.py", line 152, in handle_resource self.index.add_entry(resource.context.url, resource.filepath) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/helpers.py", line 231, in get value = self.func(obj) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/elements.py", line 196, in filepath return self.context.resolve() File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/urls.py", line 723, in resolve return url2path( File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/helpers.py", line 148, in call return self._cache_wrapper(None, *args, *kwargs) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/helpers.py", line 180, in _cache_wrapper caller, args, **kwargs) if caller is not None else self._input_func( File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/urls.py", line 598, in url2path dirname, basename = _url2path( File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/urls.py", line 513, in _url2path base, stem, ext = _filter_and_group_segments( File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/urls.py", line 481, in _filter_and_group_segments scheme, auth, host, port, path, query, fragment = parse_url(unquote(url)) File "/opt/homebrew/lib/python3.9/site-packages/pywebcopy/urls.py", line 235, in parse_url raise LocationParseError(url) pywebcopy.urls.LocationParseError: tel:+74991108328

rajatomar788 commented 2 years ago

This error basically means that in place of a url there is a telephone number. Does this error halts the saving process or the process continues after it?

serbathome commented 2 years ago

Yes, the app drops unhandled exception and stops. Telephone number is a valid element. I've made a quick fix in the module to just ignore such links as follows in the urls.py. But obviously it should be properly handled.

def parse_url(url): """ Given a url, return a parsed :class:.Url namedtuple. Best-effort is performed to parse incomplete urls. Fields not provided will be None.

Partly backwards-compatible with :mod:`urlparse`.

Example::

    >>> parse_url('http://google.com/mail/')
    Url(scheme='http', host='google.com', port=None, path='/mail/', ...)
    >>> parse_url('google.com:80')
    Url(scheme=None, host='google.com', port=80, path=None, ...)
    >>> parse_url('/foo?bar')
    Url(scheme=None, host=None, port=None, path='/foo', query='bar', ...)
"""

# While this code has overlap with stdlib's urlparse, it is much
# simplified for our needs and less annoying.
# Additionally, this implementations does silly things to be optimal
# on CPython.

if not url:
    # Empty
    return Url()

scheme = None
auth = None
host = None
port = None
path = None
fragment = None
query = None

# workaround for ignoring <a href="tel:+17035713343"/>
if 'tel:' in url:
    return Url(scheme, auth, host, port, path, query, fragment)**
rajatomar788 commented 2 years ago

Well yes it will fixed in subsequent releases. The fix you are using is not appropriate as it would take time to search through every url. But at the moment if it works then please open a pr.

serbathome commented 2 years ago

PR accepted, closing.