Open jayvdb opened 4 years ago
If there is possibility to let user chose then we can do it. On the other side if your solution supports caching of negative responses and current one does not. Then I would go with your dns_cache.
Could you improve the logging as you are suggesting, please? It might be helpful for users when they will be debugging the code.
Ok, I'll get a PR underway today.
https://github.com/lipoja/URLExtract/pull/65 is a first cut of showing that negative hits are cached.
btw, dns-cache was built for https://github.com/jayvdb/pypidb , where I am also using urlextract ; the test suite is processing a huge dataset, and exposes quite a lot of potential improvements with urlextract.
Yeah, I've already check that. Interesting project. I am glad that this small library could be part of it :) Let's file new issues for all improvements.
I've started that with https://github.com/lipoja/URLExtract/issues/68 , but those are less about the DNS aspects. To see DNS issues, actually it would be helpful to add some optional mechanism for URLExtract to keep a list of rejected URLs/domains, so that I can then easily review those in my test suite, highlighting any which might be solvable earlier in URLExtract to reduce the DNS hits.
Currently the best way to do that is to cause a test class to fail all packages and review the logs. The test runner will stop after 50 failures - edit tox.ini to see more.
e.g. https://pypi.org/project/Genshi/
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(genshi-0.7.zip) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(genshi-0.6.1.zip) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(genshi-0.6.zip) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(genshi-0.5.1.zip) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(genshi-0.5.zip) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(genshi-0.4.4.zip) gaierror(-2)
blurb:
DEBUG pypidb._pypi:_pypi.py:313 processing Webpage: https://bugs.python.org/
INFO urlextract:urlextract_core.py:518 Invalid host 'http://3.8.]'. If the host is valid report a bug.
kaitaistruct
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(gif.java) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(gif.py) gaierror(-2)
libpysal
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(libpysal.cg) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(libpysal.io) gaierror(-2)
pyxdg
DEBUG pypidb._pypi:_pypi.py:313 processing Webpage: http://freedesktop.org/wiki/Software/pyxdg
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(inifile.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(desktopentry.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(mimetype.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(desktopentry.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(desktopentry.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(desktopentry.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(mime.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menueditor.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(applications.menu) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(icontheme.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menueditor.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menueditor.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(config.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menueditor.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(inifile.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(desktopentry.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(basedirectory.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(icontheme.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(config.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(config.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(locale.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(mime.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(recentfiles.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(menu.py) gaierror(-2)
This is extremely common when processing webpages. (mwlib.ext is where I am seeing it now)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(.google-analytics.com) gaierror(-11)
mwlib.ext
DEBUG pypidb._pypi:_pypi.py:313 processing Webpage: http://www.mediawiki.org/wiki/Extension:Collection
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(ext.uls.pt) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(ext.gadget.site) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(p-mediawiki.org) gaierror(-2)
...
DEBUG pypidb._pypi:_pypi.py:313 processing Webpage: http://blog.pediapress.com/
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(service.post) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(body.mobile) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(body.mobile) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(wikipediabooks.org) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(wikipediabooks.org) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(x22service.post) gaierror(-2)
...
py-trello
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(mywebhookurl.com) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(subdomain.*.trello.com) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(subdomain.*.trello.com) gaierror(-2)
msgpack-python: See https://github.com/lipoja/URLExtract/issues/69#issuecomment-609386473
config: e.target
is really common
DEBUG pypidb._pypi:_pypi.py:313 processing Webpage: http://docs.red-dove.com/cfg/python.html
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(e.target) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(e.target) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(e.target) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(e.target) gaierror(-2)
...
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(settings.py) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(django.security) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(django.security) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(django.security) gaierror(-2)
...
DEBUG pypidb._pypi:_pypi.py:313 processing Webpage: https://play.google.com/store/apps/details?id=com.google.android.apps.authenticator2&
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.call) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(p.click) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(co.id) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(co.id) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(m.id) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(k.id) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.re) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.re) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.call) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(array.prototype.map) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(d.name) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.data) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(window.google&&window.google.sn) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(window.google.sn) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.tc) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.tc) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(c.next) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(g.next) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(d.next.next) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(c.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(b.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(b.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.bottom-this.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.right-a.left,a.bottom-a.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(g.target) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.ga) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(gbar.mls) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(gbar.bv) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(gbar.kn) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(gbar.sb) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(cp.me) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(cp.ml) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(up.sl) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(c.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(c.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.bottom-a.top) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.o.id) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(d.target) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(d.id) gaierror(-2)
INFO urlextract:urlextract_core.py:518 Invalid host 'http://.o.style.width=b.items[c]'. If the host is valid report a bug.
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(this.gb) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.qa) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(a.lb) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(person.photo) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(gbar.si) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(silk.s.sis.ca) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.teamviewer.quicksupport.market) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.teamviewer.quicksupport.market) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.teamviewer.quicksupport.market) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.teamviewer.quicksupport.market) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(u003dcom.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(u003dcom.google.android.play.games) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.play.games) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(u003dcom.teamviewer.quicksupport.market) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.teamviewer.quicksupport.market) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(u003dcom.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.apps.youtube.mango) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(u003dcom.google.android.play.games) gaierror(-2)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(com.google.android.play.games) gaierror(-2)
geoip
DEBUG pypidb._pypi:_pypi.py:313 processing Webpage: https://support.maxmind.com/geoip-data-correction-request/
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
...
DEBUG pypidb._pypi:_pypi.py:313 processing Webpage: https://support.maxmind.com/
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
INFO urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(26vddl1ry78464rf7e94z1ee-wpengine.netdna-ssl.com) gaierror(-11)
I've created https://github.com/jayvdb/dns-cache which caches negative responses, which is quite helpful when using the recently added DNS checking in URLExtract.
Should I add
dns_cache
todns_cache_install
? Or just mention it in the README for users which want more control?Also there is a fairly serious problem with the dnspython "socket" resolver on Windows during negative responses. https://github.com/rthalley/dnspython/issues/416
However the
AttributeError
caused there should be caught at https://github.com/lipoja/URLExtract/blob/1eb9ad5/urlextract/urlextract_core.py#L564 , so the logging there is the only bit which can be improved.We can also improve the logging by catching
socket.gaierror
and giving it a better log entry.