scrapy / scurl

Performance-focused replacement for Python urllib
Apache License 2.0
21 stars 6 forks source link

Additional components to implement #18

Open malloxpb opened 6 years ago

malloxpb commented 6 years ago

Here are some components that can be implemented to further enhance the performance of canonicalize_url func:

malloxpb commented 6 years ago

NOTE: Python uses quote to quote the path (replace special characters with %xx), but GURL quote the path directly when parsing the url

>>> URL('http://google.com/ /'.encode()).getAll()
result(scheme=b'http', username=b'', password=b'', host=b'google.com', port=b'', path=b'/%20/', query=b'', ref=b'')

while urlsplit from stdlib does:

>>> urlsplit('http://google.com/ /')
SplitResult(scheme='http', netloc='google.com', path='/ /', query='', fragment='')