lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
242 stars 61 forks source link

Maximum results #69

Closed jayvdb closed 4 years ago

jayvdb commented 4 years ago

Parsing https://msgpack.org/ is very expensive, esp when DNS checking is enabled.

One way to avoid DOS is to limit the number of results. The caller could do this themselves by using gen_urls and manually building a result dict and stopping at a limit they define.

However, default operation should behave in a sane manner. I suggest having a high limit to find_urls, after which an error is logged and results returned, or exception raised.

IMO the limit should be provided when instantiating the URLExtract instance, and the caller can use limit=False to disable the limit. The limit could be even higher than needed for https://msgpack.org/ - maybe even 10,000. The importance is that building a limit into the library ensures that users also think about potential for DOS.

jayvdb commented 4 years ago

e.g.

DEBUG    pypidb._pypi:_pypi.py:313 processing Webpage: http://msgpack.org/
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(xml.name) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(xml.name) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(xml.name) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(xml.name) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(quickstart-c.md) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(quickstart-cpp.md) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(howtoinstall.md) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(time.now) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(umsgpack.py) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(umsgpack.py) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(ext.data) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(in.mp) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(out.mp) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(clojure.java.io) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(mlmsgpack.mlb) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(mlmsgpack.cm) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(tutorial.md) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(msgpack.so) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(array.map) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(key.image.name) gaierror(-11)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(key.image.data) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(key.image.name) gaierror(-11)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(key.image.name) gaierror(-11)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(key.image.data) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(msgpack.marsworks.ru) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(mp.marsw.ru) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(msgpack.marsworks.ru) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(homepage:www.diocp.org) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(p.name) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(p.name) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(homepage-example.mp) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(xx.name) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(file.mp) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(file.mp) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(opcache.so) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(opcache.so) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(msgpack.so) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(opcache.so) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(msgpack.cr) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(mpc.zip) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(api.example.com) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(compatibility.md) gaierror(-2)
INFO     urlextract:urlextract_core.py:567 Unknown exception during gethostbyname(notepack.io) gaierror(-2)
lipoja commented 4 years ago

This limit was implemented and should be in latest release.