internetarchive / warcprox

WARC writing MITM HTTP/S proxy
378 stars 54 forks source link

Improve target url validation #130

Closed vbanos closed 5 years ago

vbanos commented 5 years ago

In addition to checking for scheme='http', we should also check that netloc has a value. There are many meaningless URLs that pass the current check. For instance:

In [5]: urlparse("http://")
Out[5]: ParseResult(scheme='http', netloc='', path='', params='',
query='', fragment='')

In [6]: urlparse("http:///")
Out[6]: ParseResult(scheme='http', netloc='', path='/', params='',
query='', fragment='')

netloc should always have a value.

vbanos commented 5 years ago

I've seen cases like this from using warcprox in production.