Closed malloxpb closed 6 years ago
Hey @lopuhin , after checking the canonicalize url of GURL, I noticed some details:
I think we should not use GURL container right away, but instead Cython wrap the functions in GURL that we need. Such as quoting path, params, fragment... What do you think?
I think we should not use GURL container right away, but instead Cython wrap the functions in GURL that we need. Such as quoting path, params, fragment... What do you think?
I think that's a good idea! Regarding the issues you noticed, note that two implementations can differ as long as they are consistent, e.g. for the last issue, if GURL always lowers all paths, then I think it's fine. But having an implementation that works in the same way would be better, as it would simplify testing and making sure it's correct.
With canonicalize from master, the speed test result is:
Total number of items extracted = 32799
Time spent on canonicalize_url = 1.8872314759064466
Total time taken = 1.8872314759064466
Rate of link extraction : 17379.426116367882 items/second
Using only GURL canonicalize_path has shown improved in performance:
Total number of items extracted = 32799
Time spent on canonicalize_url = 1.245034302934073
Total time taken = 1.245034302934073
Rate of link extraction : 26343.852472743292 items/second
The only errors that failed the test is the case mentioned above. Other than that, there's also another test case that failed: GURL parses gives http://www.example.com/résumé?q=r%C3%A9sum%C3%A9 while it should be http://www.example.com/r%C3%A9sum%C3%A9?q=r%C3%A9sum%C3%A9 (due to the canonicalized arg from urlparse func is not set to True)
What do you think @lopuhin ? 😄
I just implemented canonicalize_query and canonicalize_fragment here and the speed result is also improved a little bit:
Total number of items extracted = 32799
Time spent on canonicalize_url = 0.9254395314492285
Total time taken = 0.9254395314492285
Rate of link extraction : 35441.53765361321 items/second
The extra failed test is due to the parameter canonicalized from urlparse is not set to True in the test file. What do you think @lopuhin ?
Regarding the cython profiling I am still working on it 😄 I'm sorry for the delay!
Also if the encoding input for the canonicalize_url is latin, the result is incorrect. GURL gives http://www.example.com/r%C3%A9sum%C3%A9?q=r%C3%A9sum%C3%A9 while it should be http://www.example.com/r%C3%A9sum%C3%A9?q=r%E9sum%E9. I will take a look at this later!
Hey @lopuhin , aside from the other failed tests, should I change those lowercased-uppercased failed test to match the result from GURL? Since path canonicalize from GURL lowercase all the letters and I don't think there's a way to put the chars back to the way it was (canonicalize_path from GURL append the canonicalized char to a new string)
should I change those lowercased-uppercased failed test to match the result from GURL?
If you mean canonicalize_url test then yes, I think it's fine to change them. Also would be nice to document the differences compared to w3lib.
Should I also create a new README in place of the current one? I think it would be nice and clear to document whatever we need in that file first.
Hey @lopuhin , I noticed that there is this failed test: GURL: http://foo.com/AC%2FDC+rocks%3f/?yeah=1 while it should be http://foo.com/AC%2FDC+rocks%3f/?yeah=1
Do you think it is normal that GURL only lowercase %3f in the path?
Do you think it is normal that GURL only lowercase %3f in the path?
No, this looks very strange. Btw, both URLs look the same to me
yeah they look the same but only the f in %3f is lowercased. If you open the link in chrome, it will have the same result!
Hey @lopuhin , after a long time trying to figure out how the encoding works in GURL, I supposed that GURL components have a different way of handling different types of encoding (such as latin1). In GURL, it would canonicalize the url http://www.example.com/résumé?q=résumé with encoding latin1 to http://www.example.com/r%E9sum%E9?q=r%EF%BF%BDsum%EF%BF%BD, while it should be http://www.example.com/r%C3%A9sum%C3%A9?q=r%E9sum%E9. "resume" in path and "resume" in query should be canonicalized the same way as in urllib (path and query both use quote() to escape special chars). Therefore, I decided to use GURL canonicalize_path for url queries for now (which pretty much has the same functionality as quote() in urllib)! What do you think?
Sounds good! I think it's important to make this bit compatible with w3lib and respect passed encoding for query, while using utf8 for path.
which pretty much has the same functionality as quote() in urllib
Are there any differences?
hey @lopuhin , this is the source of canonicalize_path while this is the source of urllib quote().
The only difference I can spot is that quote() allows users to provide safe_char option, but since w3lib provides _safe_chars:
# https://github.com/scrapy/w3lib/blob/master/w3lib/url.py
RFC3986_GEN_DELIMS = b':/?#[]@'
RFC3986_SUB_DELIMS = b"!$&'()*+,;="
RFC3986_RESERVED = RFC3986_GEN_DELIMS + RFC3986_SUB_DELIMS
RFC3986_UNRESERVED = (string.ascii_letters + string.digits + "-._~").encode('ascii')
EXTRA_SAFE_CHARS = b'|' # see https://github.com/scrapy/w3lib/pull/25
_safe_chars = RFC3986_RESERVED + RFC3986_UNRESERVED + EXTRA_SAFE_CHARS + b'%'
as the option for the canonicalize_url, I ran the test to see how GURL would canonicalize paths with these chars, and the results between original canonicalize_url from w3lib and GURL are the same!
This PR aims to use GURL in canonicalize_url