Canonicalize using GURL

malloxpb commented 6 years ago

This PR aims to use GURL in canonicalize_url

malloxpb commented 6 years ago

Hey @lopuhin , after checking the canonicalize url of GURL, I noticed some details:

If we use the GURL container, it cannot parse the urls which havent been encoded with idna -> it results in the netloc being empty.
Default ports such as 8888 or 80 is converted to None
username and password if left empty will be emptied from the netloc
Some components in path is kept lowercased: For example: http://www.example.com/r%c3%a9sum%c3%a9 -> should be: http://www.example.com/r%C3%A9sum%C3%A9

I think we should not use GURL container right away, but instead Cython wrap the functions in GURL that we need. Such as quoting path, params, fragment... What do you think?

lopuhin commented 6 years ago

I think we should not use GURL container right away, but instead Cython wrap the functions in GURL that we need. Such as quoting path, params, fragment... What do you think?

I think that's a good idea! Regarding the issues you noticed, note that two implementations can differ as long as they are consistent, e.g. for the last issue, if GURL always lowers all paths, then I think it's fine. But having an implementation that works in the same way would be better, as it would simplify testing and making sure it's correct.

malloxpb commented 6 years ago

With canonicalize from master, the speed test result is:

Total number of items extracted = 32799
Time spent on canonicalize_url = 1.8872314759064466
Total time taken = 1.8872314759064466
Rate of link extraction : 17379.426116367882 items/second

Using only GURL canonicalize_path has shown improved in performance:

Total number of items extracted = 32799
Time spent on canonicalize_url = 1.245034302934073
Total time taken = 1.245034302934073
Rate of link extraction : 26343.852472743292 items/second

The only errors that failed the test is the case mentioned above. Other than that, there's also another test case that failed: GURL parses gives http://www.example.com/résumé?q=r%C3%A9sum%C3%A9 while it should be http://www.example.com/r%C3%A9sum%C3%A9?q=r%C3%A9sum%C3%A9 (due to the canonicalized arg from urlparse func is not set to True)

What do you think @lopuhin ? 😄

malloxpb commented 6 years ago

I just implemented canonicalize_query and canonicalize_fragment here and the speed result is also improved a little bit:

Total number of items extracted = 32799
Time spent on canonicalize_url = 0.9254395314492285
Total time taken = 0.9254395314492285
Rate of link extraction : 35441.53765361321 items/second

The extra failed test is due to the parameter canonicalized from urlparse is not set to True in the test file. What do you think @lopuhin ?

malloxpb commented 6 years ago

Regarding the cython profiling I am still working on it 😄 I'm sorry for the delay!

malloxpb commented 6 years ago

Also if the encoding input for the canonicalize_url is latin, the result is incorrect. GURL gives http://www.example.com/r%C3%A9sum%C3%A9?q=r%C3%A9sum%C3%A9 while it should be http://www.example.com/r%C3%A9sum%C3%A9?q=r%E9sum%E9. I will take a look at this later!

malloxpb commented 6 years ago

Hey @lopuhin , aside from the other failed tests, should I change those lowercased-uppercased failed test to match the result from GURL? Since path canonicalize from GURL lowercase all the letters and I don't think there's a way to put the chars back to the way it was (canonicalize_path from GURL append the canonicalized char to a new string)

lopuhin commented 6 years ago

should I change those lowercased-uppercased failed test to match the result from GURL?

If you mean canonicalize_url test then yes, I think it's fine to change them. Also would be nice to document the differences compared to w3lib.

malloxpb commented 6 years ago

Should I also create a new README in place of the current one? I think it would be nice and clear to document whatever we need in that file first.

malloxpb commented 6 years ago

Hey @lopuhin , I noticed that there is this failed test: GURL: http://foo.com/AC%2FDC+rocks%3f/?yeah=1 while it should be http://foo.com/AC%2FDC+rocks%3f/?yeah=1

Do you think it is normal that GURL only lowercase %3f in the path?

lopuhin commented 6 years ago

Do you think it is normal that GURL only lowercase %3f in the path?

No, this looks very strange. Btw, both URLs look the same to me

malloxpb commented 6 years ago

yeah they look the same but only the f in %3f is lowercased. If you open the link in chrome, it will have the same result!

malloxpb commented 6 years ago

Hey @lopuhin , after a long time trying to figure out how the encoding works in GURL, I supposed that GURL components have a different way of handling different types of encoding (such as latin1). In GURL, it would canonicalize the url http://www.example.com/résumé?q=résumé with encoding latin1 to http://www.example.com/r%E9sum%E9?q=r%EF%BF%BDsum%EF%BF%BD, while it should be http://www.example.com/r%C3%A9sum%C3%A9?q=r%E9sum%E9. "resume" in path and "resume" in query should be canonicalized the same way as in urllib (path and query both use quote() to escape special chars). Therefore, I decided to use GURL canonicalize_path for url queries for now (which pretty much has the same functionality as quote() in urllib)! What do you think?

lopuhin commented 6 years ago

Sounds good! I think it's important to make this bit compatible with w3lib and respect passed encoding for query, while using utf8 for path.

which pretty much has the same functionality as quote() in urllib

Are there any differences?

malloxpb commented 6 years ago

hey @lopuhin , this is the source of canonicalize_path while this is the source of urllib quote().

The only difference I can spot is that quote() allows users to provide safe_char option, but since w3lib provides _safe_chars:

# https://github.com/scrapy/w3lib/blob/master/w3lib/url.py
RFC3986_GEN_DELIMS = b':/?#[]@'
RFC3986_SUB_DELIMS = b"!$&'()*+,;="
RFC3986_RESERVED = RFC3986_GEN_DELIMS + RFC3986_SUB_DELIMS
RFC3986_UNRESERVED = (string.ascii_letters + string.digits + "-._~").encode('ascii')
EXTRA_SAFE_CHARS = b'|'  # see https://github.com/scrapy/w3lib/pull/25

_safe_chars = RFC3986_RESERVED + RFC3986_UNRESERVED + EXTRA_SAFE_CHARS + b'%'

as the option for the canonicalize_url, I ran the test to see how GURL would canonicalize paths with these chars, and the results between original canonicalize_url from w3lib and GURL are the same!

scrapy / scurl

Canonicalize using GURL #23