Open kmike opened 8 years ago
Hello,
Concerning URL parsing, why not considering directly URI? I imagine an URI class which would encapsulate all the logic for URI/URL encoding/decoding (e.g str <--> unicode, idna), quoting/unquoting path and query, handling "invalid" relative URL (#1304) and all the other necessary quirks to successfully parse an URI/URL.
I would be happy to come up with a mockup in the following days if need be.
EDIT: Inspired by URI.js
Hey @Djayb6,
A new Python package for URL/URI parsing sounds good to me. I like an idea of using an actual browser implementation under the hood for this Python package (copy-paste C/C++/Rust? code, create a Cython/CFFI wrapper) instead of starting from scratch. Browsers already handle all the quirks, and their parsing routines are fast.
a chromium/mozilla wrapper with a purl-like API would be a godsend. i already use purl wherever I can since it makes for readable code that is easy to teach.
Hey @kmike there is Yandex internal URL parsing routine. https://github.com/yandex/balancer/tree/master/util/uri Used and tested in their content system infrastructure. It's dependent on Ragel (https://en.wikipedia.org/wiki/Ragel) for generation of URL parser and encoder. Which is under GNU, but Ragel is only needed during build phase.
I haven't looked in details, but Yandex URL parser looks good - it looks nicely separated from other code in the balancer repo. Ragel licensing shouldn't be an issue.
I analyzed some suitable candidates for wrapping. Below are the times intervals in seconds, for 10^6 calls to the parsing methods of some URL Parser implementations, on a laptop i5 processor.
The URL used is - http://www.example.com
.
I haven't tested the Mozilla url parser, but I expect its performance to be similar to GURL's, since both are based on nsURLParsers.cpp
.
The scope for optimizing the wrapper is very little. It is already as thin as possible. The speed gained by the Wrapper over urlparse, for other operations, ranges between 1 - 2 times(I haven't implemented all of them yet).
Issues such as #1304 will be solved by using the wrapper, but we don't see substantial speed gains, though some functions, such as canonicalize_url
can be sped up by implementing them using the wrapper, however this still won't address the bottleneck issue, because urlsplit
, urlparse
and urljoin
are the foremost bottlenecks(according to my profiling results).
What are your thoughts?
Thank you.
@Preetwinder nice review! Do you have the benchmarking code? Note that Python urlparse uses cache by default, so if you parse the same URL multiple times it won't be actually parsed again and again. It'd be nice to benchmark parsing of different URLs with paths, query arguments and fragments; http://www.example.com
is too easy.
Will this PR discussion also target the link encoding bugs?
@Digenis I don't know, there is no a PR yet; I opened this issue for discussion, to get more feedback and highlight a problem.
@Preetwinder This list can be useful for your test https://github.com/sibiryakov/general-spider/blob/master/seeds_es_dmoz.txt. The problem with your test is code you're running will benefit from using CPU cache much more than production spider.
The benchmark results in my previous post were not very useful, since they used a single URL, as pointed out by kmike and sibiryakov . The test data used in the new results is -
The code used is -
with open(“urltesting/urls”, “r”) as f:
t = time()
for line in f:
#tested code
print time() - t
The results for some operations are below, the times are averaged over 10 runs.
Tested Code | Time for Test 1 (seconds) |
Time for Test 2 (seconds) |
---|---|---|
urlparse(line) |
0.380 | 0.103 |
urlsplit(line) |
0.301 | 0.079 |
urljoin(line, “something.html”) |
0.760 | 0.196 |
The following results are for the Cython wrapper. URL
is the class, the constructor performs the parsing, and the components are accessible through individual getter functions. The getAll
method returns an 8-length namedtuple consisting of the components. The Resolve
method is similar to urljoin
(Note - Resolve
is currently imperfectly implemented, and can be sped up).
Tested Code | Time for Test 1 (seconds) |
Time for Test 2 (seconds) |
---|---|---|
URL(line) |
0.110 | 0.028 |
a = URL(line) a.getAll() |
0.197 | 0.055 |
a = URL(line) a.Resolve(“something.html”) |
0.243 | 0.065 |
a = URL(line) scheme = a.scheme() username = a.username() #other components |
0.129 – 0.198, depending upon number of components accessed. | 0.032 - 0.053, depending upon number of components accessed. |
The GURL
constructor performs some canonicalization, some filesystem path specific changes, and some other checks and modifications by default. An alternative to using GURL
here is to directly call the ParseStandardURL
function, which behaves very similarly to urlparse
. The timing for ParseStandardURL
is 0.128 seconds(for Test 1), if we return a 8-length namedtuple(slightly lower time for returning some other data structures).
Also directly returning the URL string from URL.Resolve(instead of a URL object) reduces it's time to about 0.165 seconds(for Test 1).
Wrapping a custom, functional, interface to the Chromium URL Parser functions, instead of using the GURL interface, will likely result in small speed improvements for other operations as well.
With the results seen so far(3-4 times for general operations and possibly higher for operations like canonicalize_url
), do you think that the speed improvements would be worth the switch?
Thank you.
@Digenis, can you please provide the issue page links to the bugs you mention. Thank you.
@Preetwinder,
See https://github.com/scrapy/scrapy/issues/1403#issuecomment-149194449
... domain should be encoded using IDNA, path should be encoded to UTF8 and then escaped to ASCII, query should be encoded to response encoding and then escaped to ASCII - this is how browsers work.
I asked here because I think the whole solution is not meant to be put in the link extractor.
@Preetwinder nice job, really! I really do value your effort! Here are the problems in your testing code:
Is it possible to use GURL class directly? Memory allocations for wrapper class could cost something.
Thanks for your work.
Thank you for the feedback.
use clock(), it's meant for benchmarking, and test it on linux (not windows),
I am using Linux. I have changed the timing function to clock()
in the testing code.
prevent influence from file system access delays, it's more reliable to read the file contents in memory, and parse them from memory,
In the new testing code, the content of the file is read into memory using readlines()
.
try if disabling GC affects running times.
I tried this, but it didn't make any noticeable difference.
Following is the new testing code -
from numpy import median, percentile, mean
from time import clock
import gc
gc.disable()
with open("tests/urls.txt") as file:
urls = file.readlines()
times = [0]*len(urls)
x = 0
for url in urls:
t = clock()
# tested code
times[x] = clock() - t
x += 1
print sum(times)
print mean(times)
print median(times)
print percentile(times, 90)
The timings for the new code for the 80,000 URL`s test case are -
Tested Code | Total | Mean | Median | 90% |
---|---|---|---|---|
a = urlparse(line) |
0.40 | 4.9e-06 | 5e-06 | 5.5e-06 |
a = urljoin(line, "something") |
0.80 | 9.9e-06 | 9.5e-06 | 11e-06 |
a = URL(line) |
0.12 | 1.45e-06 | 1.1e-06 | 2e-06 |
a = URL(line) a.getAll() |
0.20 | 2.2e-06 | 2.3e-06 | 3.e-06 |
a = URL(line) a.Resolve(“something.html”) |
0.17 | 2.1e-06 | 2e-06 | 3e-06 |
a = ParseStandardURL |
0.14 | 1.7e-06 | 2e-06 | 2e-06 |
The median and percentile values being too discrete is due to the precision limitations of the clock function. I have averaged the median and percentile values over a few runs.
According to my research Cython wrappers are generally pretty fast, and there isn't much avoidable overhead. If you think other methods such as SWIG, ctypes, or manually writing an extension, would be faster, I can try them out. Thank you.
@Preetwinder Awesome! So we have about 5x times of difference between native urlparse and Chrome URL parser wrapped with Cython? 5 vs 1.1 mcs per URL, which isn't bad I think. I can't find code for getAll()
function, could you post the link?
I would like to get back to my question
Is it possible to use GURL class directly? Memory allocations for wrapper class could cost something.
Here https://github.com/Preetwinder/gurl-cython/blob/master/pygurl.pyx#L11 you allocate memory for every new URL wrapping object. Can we use GURL directly to avoid that? It probably requires ctypes/manual extension?
What API, implementation differences GURL and urlparse have? I'm trying to imagine possible problems of adoption of GURL-based URL parsing in Scrapy.
Could you also try Rust/Yandex libraries?
Thanks for your time again.
Thank you for the feedback.
So we have about 5x times of difference between native urlparse and Chrome URL parser wrapped with Cython?
Not for all cases, URL
and urlparse
are not directly comparable since URL doesn't return anything, that's what getAll
does. The speed difference for urljoin
and Resolve
is roughly 5 times. It is only two times for urlparse
and getAll
.
I can't find code for getAll() function, could you post the link?
I hadn't updated the repo before, I have added it now, here.
I would like to get back to my question
I misunderstood your question before. The allocation that takes place in the line you mentioned is of the GURL C++ class, not a wrapping object. _thisptr
holds the pointer to a C++ class instance, so the interaction with the C++ class is as direct as possible. Instead of using the GURL class, we can also use some of it's composite functions, like ParseStandardURL
which is a bit faster, however it doesn't perform some canonicalization and some other checks.
What API, implementation differences GURL and urlparse have? I'm trying to imagine possible problems of adoption of GURL-based URL parsing in Scrapy.
urlparse has urlparse
which returns a 6-tuple, urlunparse
which converts the tuple back to a string, and urljoin
, which joins two urls(strings).
The GURL class has, a constructor, initialized by a string, checker and getter methods for the individual URL components(hasScheme
, getScheme
etc), Resolve
method, which joins a string with the GURL object and returns a URL. ReplaceComponent
method accepts components which need to be changed and returns the replaced URL(I need to work on this yet).
Could you also try Rust/Yandex libraries?
I couldn't find time for doing this yet because I have been working on another Scrapy issue. I will try these and also test the original spider with which kmike discovered the bottleneck using GURL, after my examination next week.
Thank you.
Hey @Preetwinder here is what I've got using Yandex URL parser:
count 82257, avg 580.3806971 ns, median 507 ns, 90% 682 ns
.
It's collected using Chromium 80K test set (chosen by you). Major difference is I wrote a C++ application to get a feeling on what we loose in bindings.
So, as you see it's about 3x times faster. Though it's not correct to compare it to your number because they're collected using bindings. Now, I'm going to get your GURL library and test it with the same app.
count 82257 avg 794.1626245 median 644 90% 1144
for a GURL. So, Yandex one is from 1.25 to 2x faster. May be this is connected with more efficient memory allocation model.
Also, we can make another conclusion. Python bindings probably cost us the same amount of CPU cycles as parsing.
I've got an idea. Let's create a library supporting batch operations on URL parsing. For Scrapy it should be a common use case. Let me know, what you think!
I've made a wrong conclusion about Yandex parser being 1000 times faster, and updated the comment.
@sibiryakov , what do you mean by "batch operations on URL parsing"?
Batch of URLs as input, and response is vector of results.
Here is the testing code https://github.com/sibiryakov/balancer/blob/urlbench/tools/urlbench/main.cpp
Thanks for testing the yandex parser. I tried using a batched approach for the chromium parser, the C++ code is here and the Cython code is here The results are - total = 0.184, mean = 4.5e-05, median = 4.4e05, 90% = 4.8e-05. This is for a batch size of 20, performance is slightly worse for smaller batch sizes and slightly better for larger sizes. As can be seen the improvement is marginal, about 5-10%(over GURL). We can make small improvements to this with a few minor modifications, but nothing big. I believe that such a small improvement is not worth moving to a batched approach. Even if we use the yandex library(assuming similar wrapping overheads)we only get 3-4 times improvement over urlparse. As we discussed earlier, one way forward would be to directly implement functions like canonicalization in C++ or Cython. I'll further investigate the usefulness of this option.
Hi!
Thanks for this very informative thread :-)
I've just started a performance-focused alternative to urlparse
, with the goal of eventually being a drop-in replacement. I also added new benchmarks of a few libraries mentioned here. Check out the results:
https://github.com/commonsearch/urlparse4
It's early but feedbacks are very welcome!
As I mentioned in #3134 , I have some result regarding the urlparse test on libraries that @kmike suggested :)
Python2 | Python3 | |
---|---|---|
python urlparse | 0.113 seconds | 0.179 seconds |
purl | 0.549 seconds | 0.512 seconds |
furl | 6.299 seconds | 4.550 seconds |
yurl | 0.105 seconds | 0.100 seconds |
And the test code is
import time
start_time = time.time()
with open('urls.txt', 'r') as f:
for url in f:
a = URL(url)
print("--- %s seconds ---" % (time.time() - start_time))
where urls.txt contains 23,000 different urls.
Some urljoin result:
Python2 | Python3 | |
---|---|---|
python urlparse urljoin | 0.32 seconds | 0.46 seconds |
purl urljoin | Could not run | 1.47 second |
furl urljoin | 13.85 seconds | 10.21 seconds |
yurl urljoin | 0.32 seconds | 0.30 seconds |
The purl library does not support urljoin like function so I used the add_path_segment function, which adds string to the end of the urls. It is not exactly what we need, but I tested it anyway. In addition, this function could not run on Python 2 as it generates the UnicodeDecodeError. The furl library on Python 2 acts like the Python urlparse library, which is what was mentioned in #1304. I will continue update this issue as I make some tests on more suggested libraries, including ones that are not pure python :) And I will put the license issue as well as the library maintenance status on my mind. Thank you @lopuhin for your suggestion :)
Thanks @nctl144 , interesting results, curious to see results of other libraries. It's better to exclude time taken to load the test URLs, and also measure time per URL for easier comparison with other benchmarks. E.g. see this benchmarking code https://github.com/scrapy/scrapy/issues/1306#issuecomment-177201589 above.
It's better to exclude time taken to load the test URLs, and also measure time per URL for easier comparison with other benchmarks
@lopuhin thank you for pointing that out, I will work on that. Btw, do you have any suggestion on other libraries that I should look for? right now I'm trying to recreate the tests that were suggested in this issue. Having some specific libraries in mind would help me narrow down the searching scope a lot :)
@nctl144 sorry, don't have any specific libraries in mind besides the ones mentioned in this thread. I think it makes sense to start with libraries that are better supported and are easier to install.
@lopuhin I tried using those libraries that are mentioned in this thread, and I got some more result. The difference of the performance between on py2 and py3 is reduced. It's just something that we should keep in mind :)
url parse:
Python2 | Python3 | |
---|---|---|
uriparser using Cython url parse | 0.18 sec | 0.17 sec |
uriparser using Cython url join | 0.37 sec | 0.30 sec |
gurl using Cython url parse | 1.16 sec | 1.2 sec |
With the help of @lopuhin and @cathalgarvey, I made some profiling test (the url file is the chromium url test file, which contains about 83k urls) and here is some result:
Python 3 (for pure python libs) | |
---|---|
uriparser | 0.157175 sec |
urllib | 0.66 sec |
purl | 1.80 sec |
yurl | 0.36 sec |
yarl | 8.21 sec |
I like uriparser a lot and I think it has the potential to increase the speed of Scrapy. According to the benchmarker that I ran, canonicalize_url is the function that uses a lot of time (More information can be seen here ) (seems like vmprof is not working... I will update the comment later if it still does not work) Therefore, there are a lot of other url parsing functions that contribute to the run time of a spider, such as parsing url queries, quoting special characters, urlunparse.
The table above only shows the speed of the main function, which parses urls into components. However, based on the time it parses the urls, I believe we can assume that other functions such as parsing queries, unparse and url quoting will also take less time. Thus, if parsing the urls is faster already, we can speed up canonicalize_url significantly since we will be replacing parse_qsl, urlunparse, quote as well!
What do you guys think? :)
yarl
is obviously about 10 times slower if not cythonized.idna
package is not fast. In yarl
we prefer correctness over speed.Check "http://schloß.com"
. Correct result should be 'http://xn--schlo-pqa.com'
, IDNA2003 returns 'http://schloss.com'
I suspect other minor differences between libraries.
Just speed comparison is useless until compared libs do different work.
Thanks for clarification re cythonizing of yarl
and idna support @asvetlov , correctness is important for scrapy.
@asvetlov Thank you so much for your suggestion. I will check the result again and I am sorry for any incorrect result!
Hey guys, I made a correctness test repo for some urlparsing libraries: GURL chromium, Rust-url and uriparser (based on the modern url test cases mentioned by @kmike https://raw.githubusercontent.com/web-platform-tests/wpt/master/url/resources/urltestdata.json).
There are 409 test cases in this json file. I tested each library's ability to parse scheme, netloc and path. The following table notes the amount of wrong result that each library has:
Wrong scheme | Wrong netloc | Wrong path | |
---|---|---|---|
rust | 0 | 0 | 0 |
uriparser | 0 | 11 | 18 |
gurl-cython | 0 | 34 | 56 |
The result in detail can be found in here.
And the repo to the testing code can be found here. Let me know what you guys think 😃
Interesting read: https://daniel.haxx.se/blog/2022/01/10/dont-mix-url-parsers/
I profiled a simple Scrapy spider which just downloads pages and follows links extracted using LinkExtractor; it turns out one of the main bottlenecks is urlparse module and our related functions like canonicalize_url or request_fingerprint. They take several times more CPU time than e.g. HTML parsing itself.
Also, these functions don't follow modern HTML5 standards - see e.g. https://github.com/scrapy/scrapy/issues/1304
I wonder if we can fix it all in once by wrapping an actual browser implementation. Some options:
Modern tests for URL parsing: https://github.com/w3c/web-platform-tests/tree/master/url