scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Speedup & fix URL parsing #1306

Open kmike opened 8 years ago

kmike commented 8 years ago

I profiled a simple Scrapy spider which just downloads pages and follows links extracted using LinkExtractor; it turns out one of the main bottlenecks is urlparse module and our related functions like canonicalize_url or request_fingerprint. They take several times more CPU time than e.g. HTML parsing itself.

Also, these functions don't follow modern HTML5 standards - see e.g. https://github.com/scrapy/scrapy/issues/1304

I wonder if we can fix it all in once by wrapping an actual browser implementation. Some options:

Modern tests for URL parsing: https://github.com/w3c/web-platform-tests/tree/master/url

Djayb6 commented 8 years ago

Hello,

Concerning URL parsing, why not considering directly URI? I imagine an URI class which would encapsulate all the logic for URI/URL encoding/decoding (e.g str <--> unicode, idna), quoting/unquoting path and query, handling "invalid" relative URL (#1304) and all the other necessary quirks to successfully parse an URI/URL.

Advantages

I would be happy to come up with a mockup in the following days if need be.

EDIT: Inspired by URI.js

kmike commented 8 years ago

Hey @Djayb6,

A new Python package for URL/URI parsing sounds good to me. I like an idea of using an actual browser implementation under the hood for this Python package (copy-paste C/C++/Rust? code, create a Cython/CFFI wrapper) instead of starting from scratch. Browsers already handle all the quirks, and their parsing routines are fast.

kmike commented 8 years ago

Regarding API, there are several existing libraries which try to provide an API better than urlparse - purl, furl, yurl. A simpler API is nice, but it is not the main goal.

barraponto commented 8 years ago

a chromium/mozilla wrapper with a purl-like API would be a godsend. i already use purl wherever I can since it makes for readable code that is easy to teach.

sibiryakov commented 8 years ago

Hey @kmike there is Yandex internal URL parsing routine. https://github.com/yandex/balancer/tree/master/util/uri Used and tested in their content system infrastructure. It's dependent on Ragel (https://en.wikipedia.org/wiki/Ragel) for generation of URL parser and encoder. Which is under GNU, but Ragel is only needed during build phase.

kmike commented 8 years ago

I haven't looked in details, but Yandex URL parser looks good - it looks nicely separated from other code in the balancer repo. Ragel licensing shouldn't be an issue.

Preetwinder commented 8 years ago

I analyzed some suitable candidates for wrapping. Below are the times intervals in seconds, for 10^6 calls to the parsing methods of some URL Parser implementations, on a laptop i5 processor. The URL used is - http://www.example.com.

  1. C++ Network Library ~27 seconds.
  2. Python urlparse ~1.5 seconds.
  3. GURL component from Chromium upstream ~0.5 seconds (increases with URL length).
  4. Cython Wrapper around GURL ~0.8 seconds.(1.2 seconds with a longer URL{75 characters}).
  5. uriparser ~0.2 seconds (increases with URL length, ~0.5 seconds for a longer URL{75 characters}).

I haven't tested the Mozilla url parser, but I expect its performance to be similar to GURL's, since both are based on nsURLParsers.cpp.

The scope for optimizing the wrapper is very little. It is already as thin as possible. The speed gained by the Wrapper over urlparse, for other operations, ranges between 1 - 2 times(I haven't implemented all of them yet). Issues such as #1304 will be solved by using the wrapper, but we don't see substantial speed gains, though some functions, such as canonicalize_url can be sped up by implementing them using the wrapper, however this still won't address the bottleneck issue, because urlsplit, urlparse and urljoin are the foremost bottlenecks(according to my profiling results). What are your thoughts? Thank you.

kmike commented 8 years ago

@Preetwinder nice review! Do you have the benchmarking code? Note that Python urlparse uses cache by default, so if you parse the same URL multiple times it won't be actually parsed again and again. It'd be nice to benchmark parsing of different URLs with paths, query arguments and fragments; http://www.example.com is too easy.

Digenis commented 8 years ago

Will this PR discussion also target the link encoding bugs?

kmike commented 8 years ago

@Digenis I don't know, there is no a PR yet; I opened this issue for discussion, to get more feedback and highlight a problem.

sibiryakov commented 8 years ago

@Preetwinder This list can be useful for your test https://github.com/sibiryakov/general-spider/blob/master/seeds_es_dmoz.txt. The problem with your test is code you're running will benefit from using CPU cache much more than production spider.

Preetwinder commented 8 years ago

The benchmark results in my previous post were not very useful, since they used a single URL, as pointed out by kmike and sibiryakov . The test data used in the new results is -

  1. For Test 1, 80,000 different URLs taken from - Chromium URL Parser Performance Test (Caution, Large Page).
  2. For Test 2, the test data provided by sibiryakov in the previous post is used, it contains about 20,000 URLs.

The code used is -

with open(“urltesting/urls”, “r”) as f:
    t = time()
    for line in f:
        #tested code
print time() - t

The results for some operations are below, the times are averaged over 10 runs.

Tested Code Time for Test 1
(seconds)
Time for Test 2
(seconds)
urlparse(line) 0.380 0.103
urlsplit(line) 0.301 0.079
urljoin(line, “something.html”) 0.760 0.196

The following results are for the Cython wrapper. URL is the class, the constructor performs the parsing, and the components are accessible through individual getter functions. The getAll method returns an 8-length namedtuple consisting of the components. The Resolve method is similar to urljoin(Note - Resolve is currently imperfectly implemented, and can be sped up).

Tested Code Time for Test 1
(seconds)
Time for Test 2
(seconds)
URL(line) 0.110 0.028
a = URL(line)
a.getAll()
0.197 0.055
a = URL(line)
a.Resolve(“something.html”)
0.243 0.065
a = URL(line)
scheme = a.scheme()
username = a.username()
#other components
0.129 – 0.198, depending upon number of components accessed. 0.032 - 0.053, depending upon number of components accessed.
Preetwinder commented 8 years ago

The GURL constructor performs some canonicalization, some filesystem path specific changes, and some other checks and modifications by default. An alternative to using GURL here is to directly call the ParseStandardURL function, which behaves very similarly to urlparse. The timing for ParseStandardURL is 0.128 seconds(for Test 1), if we return a 8-length namedtuple(slightly lower time for returning some other data structures).

Also directly returning the URL string from URL.Resolve(instead of a URL object) reduces it's time to about 0.165 seconds(for Test 1).

Wrapping a custom, functional, interface to the Chromium URL Parser functions, instead of using the GURL interface, will likely result in small speed improvements for other operations as well.

With the results seen so far(3-4 times for general operations and possibly higher for operations like canonicalize_url), do you think that the speed improvements would be worth the switch? Thank you.

Preetwinder commented 8 years ago

@Digenis, can you please provide the issue page links to the bugs you mention. Thank you.

Digenis commented 8 years ago

@Preetwinder,

1405, #1403, #998

See https://github.com/scrapy/scrapy/issues/1403#issuecomment-149194449

... domain should be encoded using IDNA, path should be encoded to UTF8 and then escaped to ASCII, query should be encoded to response encoding and then escaped to ASCII - this is how browsers work.

I asked here because I think the whole solution is not meant to be put in the link extractor.

sibiryakov commented 8 years ago

@Preetwinder nice job, really! I really do value your effort! Here are the problems in your testing code:

Is it possible to use GURL class directly? Memory allocations for wrapper class could cost something.

Thanks for your work.

Preetwinder commented 8 years ago

Thank you for the feedback.

use clock(), it's meant for benchmarking, and test it on linux (not windows),

I am using Linux. I have changed the timing function to clock() in the testing code.

prevent influence from file system access delays, it's more reliable to read the file contents in memory, and parse them from memory,

In the new testing code, the content of the file is read into memory using readlines().

try if disabling GC affects running times.

I tried this, but it didn't make any noticeable difference.

Following is the new testing code -

from numpy import median, percentile, mean
from time import clock
import gc
gc.disable()

with open("tests/urls.txt") as file:
    urls = file.readlines()
    times = [0]*len(urls)
    x = 0
    for url in urls:
        t = clock()
        # tested code
        times[x] = clock() - t
        x += 1
    print sum(times)
    print mean(times)
    print median(times)
    print percentile(times, 90)

The timings for the new code for the 80,000 URL`s test case are -

Tested Code Total Mean Median 90%
a = urlparse(line) 0.40 4.9e-06 5e-06 5.5e-06
a = urljoin(line, "something") 0.80 9.9e-06 9.5e-06 11e-06
a = URL(line) 0.12 1.45e-06 1.1e-06 2e-06
a = URL(line)
a.getAll()
0.20 2.2e-06 2.3e-06 3.e-06
a = URL(line)
a.Resolve(“something.html”)
0.17 2.1e-06 2e-06 3e-06
a = ParseStandardURL 0.14 1.7e-06 2e-06 2e-06

The median and percentile values being too discrete is due to the precision limitations of the clock function. I have averaged the median and percentile values over a few runs.

According to my research Cython wrappers are generally pretty fast, and there isn't much avoidable overhead. If you think other methods such as SWIG, ctypes, or manually writing an extension, would be faster, I can try them out. Thank you.

sibiryakov commented 8 years ago

@Preetwinder Awesome! So we have about 5x times of difference between native urlparse and Chrome URL parser wrapped with Cython? 5 vs 1.1 mcs per URL, which isn't bad I think. I can't find code for getAll() function, could you post the link?

I would like to get back to my question

Is it possible to use GURL class directly? Memory allocations for wrapper class could cost something.

Here https://github.com/Preetwinder/gurl-cython/blob/master/pygurl.pyx#L11 you allocate memory for every new URL wrapping object. Can we use GURL directly to avoid that? It probably requires ctypes/manual extension?

What API, implementation differences GURL and urlparse have? I'm trying to imagine possible problems of adoption of GURL-based URL parsing in Scrapy.

Could you also try Rust/Yandex libraries?

Thanks for your time again.

Preetwinder commented 8 years ago

Thank you for the feedback.

So we have about 5x times of difference between native urlparse and Chrome URL parser wrapped with Cython?

Not for all cases, URL and urlparse are not directly comparable since URL doesn't return anything, that's what getAll does. The speed difference for urljoin and Resolve is roughly 5 times. It is only two times for urlparse and getAll.

I can't find code for getAll() function, could you post the link?

I hadn't updated the repo before, I have added it now, here.

I would like to get back to my question

I misunderstood your question before. The allocation that takes place in the line you mentioned is of the GURL C++ class, not a wrapping object. _thisptr holds the pointer to a C++ class instance, so the interaction with the C++ class is as direct as possible. Instead of using the GURL class, we can also use some of it's composite functions, like ParseStandardURL which is a bit faster, however it doesn't perform some canonicalization and some other checks.

What API, implementation differences GURL and urlparse have? I'm trying to imagine possible problems of adoption of GURL-based URL parsing in Scrapy.

urlparse has urlparse which returns a 6-tuple, urlunparse which converts the tuple back to a string, and urljoin, which joins two urls(strings).

The GURL class has, a constructor, initialized by a string, checker and getter methods for the individual URL components(hasScheme, getScheme etc), Resolve method, which joins a string with the GURL object and returns a URL. ReplaceComponent method accepts components which need to be changed and returns the replaced URL(I need to work on this yet).

Could you also try Rust/Yandex libraries?

I couldn't find time for doing this yet because I have been working on another Scrapy issue. I will try these and also test the original spider with which kmike discovered the bottleneck using GURL, after my examination next week.

Thank you.

sibiryakov commented 8 years ago

Hey @Preetwinder here is what I've got using Yandex URL parser: count 82257, avg 580.3806971 ns, median 507 ns, 90% 682 ns.

It's collected using Chromium 80K test set (chosen by you). Major difference is I wrote a C++ application to get a feeling on what we loose in bindings.

So, as you see it's about 3x times faster. Though it's not correct to compare it to your number because they're collected using bindings. Now, I'm going to get your GURL library and test it with the same app.

sibiryakov commented 8 years ago

count 82257 avg 794.1626245 median 644 90% 1144 for a GURL. So, Yandex one is from 1.25 to 2x faster. May be this is connected with more efficient memory allocation model.

Also, we can make another conclusion. Python bindings probably cost us the same amount of CPU cycles as parsing.

sibiryakov commented 8 years ago

I've got an idea. Let's create a library supporting batch operations on URL parsing. For Scrapy it should be a common use case. Let me know, what you think!

sibiryakov commented 8 years ago

I've made a wrong conclusion about Yandex parser being 1000 times faster, and updated the comment.

redapple commented 8 years ago

@sibiryakov , what do you mean by "batch operations on URL parsing"?

sibiryakov commented 8 years ago

Batch of URLs as input, and response is vector of results.

sibiryakov commented 8 years ago

Here is the testing code https://github.com/sibiryakov/balancer/blob/urlbench/tools/urlbench/main.cpp

Preetwinder commented 8 years ago

Thanks for testing the yandex parser. I tried using a batched approach for the chromium parser, the C++ code is here and the Cython code is here The results are - total = 0.184, mean = 4.5e-05, median = 4.4e05, 90% = 4.8e-05. This is for a batch size of 20, performance is slightly worse for smaller batch sizes and slightly better for larger sizes. As can be seen the improvement is marginal, about 5-10%(over GURL). We can make small improvements to this with a few minor modifications, but nothing big. I believe that such a small improvement is not worth moving to a batched approach. Even if we use the yandex library(assuming similar wrapping overheads)we only get 3-4 times improvement over urlparse. As we discussed earlier, one way forward would be to directly implement functions like canonicalization in C++ or Cython. I'll further investigate the usefulness of this option.

sylvinus commented 7 years ago

Hi!

Thanks for this very informative thread :-)

I've just started a performance-focused alternative to urlparse, with the goal of eventually being a drop-in replacement. I also added new benchmarks of a few libraries mentioned here. Check out the results: https://github.com/commonsearch/urlparse4

It's early but feedbacks are very welcome!

malloxpb commented 6 years ago

As I mentioned in #3134 , I have some result regarding the urlparse test on libraries that @kmike suggested :)

  Python2 Python3
python urlparse 0.113 seconds 0.179 seconds
purl 0.549 seconds 0.512 seconds
furl 6.299 seconds 4.550 seconds
yurl 0.105 seconds 0.100 seconds

And the test code is


import time

start_time = time.time()

with open('urls.txt', 'r') as f:
    for url in f:
        a = URL(url)

print("--- %s seconds ---" % (time.time() - start_time))

where urls.txt contains 23,000 different urls.

Some urljoin result:

Python2 Python3
python urlparse urljoin 0.32 seconds 0.46 seconds
purl urljoin Could not run 1.47 second
furl urljoin 13.85 seconds 10.21 seconds
yurl urljoin 0.32 seconds 0.30 seconds

The purl library does not support urljoin like function so I used the add_path_segment function, which adds string to the end of the urls. It is not exactly what we need, but I tested it anyway. In addition, this function could not run on Python 2 as it generates the UnicodeDecodeError. The furl library on Python 2 acts like the Python urlparse library, which is what was mentioned in #1304. I will continue update this issue as I make some tests on more suggested libraries, including ones that are not pure python :) And I will put the license issue as well as the library maintenance status on my mind. Thank you @lopuhin for your suggestion :)

lopuhin commented 6 years ago

Thanks @nctl144 , interesting results, curious to see results of other libraries. It's better to exclude time taken to load the test URLs, and also measure time per URL for easier comparison with other benchmarks. E.g. see this benchmarking code https://github.com/scrapy/scrapy/issues/1306#issuecomment-177201589 above.

malloxpb commented 6 years ago

It's better to exclude time taken to load the test URLs, and also measure time per URL for easier comparison with other benchmarks

@lopuhin thank you for pointing that out, I will work on that. Btw, do you have any suggestion on other libraries that I should look for? right now I'm trying to recreate the tests that were suggested in this issue. Having some specific libraries in mind would help me narrow down the searching scope a lot :)

lopuhin commented 6 years ago

@nctl144 sorry, don't have any specific libraries in mind besides the ones mentioned in this thread. I think it makes sense to start with libraries that are better supported and are easier to install.

malloxpb commented 6 years ago

@lopuhin I tried using those libraries that are mentioned in this thread, and I got some more result. The difference of the performance between on py2 and py3 is reduced. It's just something that we should keep in mind :)

url parse:

  Python2 Python3
uriparser using Cython url parse 0.18 sec 0.17 sec
uriparser using Cython url join 0.37 sec 0.30 sec
gurl using Cython url parse 1.16 sec 1.2 sec
malloxpb commented 5 years ago

With the help of @lopuhin and @cathalgarvey, I made some profiling test (the url file is the chromium url test file, which contains about 83k urls) and here is some result:

  Python 3 (for pure python libs)
uriparser 0.157175 sec
urllib 0.66 sec
purl 1.80 sec
yurl 0.36 sec
yarl 8.21 sec

I like uriparser a lot and I think it has the potential to increase the speed of Scrapy. According to the benchmarker that I ran, canonicalize_url is the function that uses a lot of time (More information can be seen here ) (seems like vmprof is not working... I will update the comment later if it still does not work) Therefore, there are a lot of other url parsing functions that contribute to the run time of a spider, such as parsing url queries, quoting special characters, urlunparse.

The table above only shows the speed of the main function, which parses urls into components. However, based on the time it parses the urls, I believe we can assume that other functions such as parsing queries, unparse and url quoting will also take less time. Thus, if parsing the urls is faster already, we can speed up canonicalize_url significantly since we will be replacing parse_qsl, urlunparse, quote as well!

What do you guys think? :)

asvetlov commented 5 years ago
  1. What do you mean by "pure python libs"? yarl is obviously about 10 times slower if not cythonized.
  2. Another source of yarl bad performance is IDNA2008 support. Python provides IDNA2003 only, third-party idna package is not fast. In yarl we prefer correctness over speed.

Check "http://schloß.com". Correct result should be 'http://xn--schlo-pqa.com', IDNA2003 returns 'http://schloss.com' I suspect other minor differences between libraries. Just speed comparison is useless until compared libs do different work.

lopuhin commented 5 years ago

Thanks for clarification re cythonizing of yarl and idna support @asvetlov , correctness is important for scrapy.

malloxpb commented 5 years ago

@asvetlov Thank you so much for your suggestion. I will check the result again and I am sorry for any incorrect result!

malloxpb commented 5 years ago

Hey guys, I made a correctness test repo for some urlparsing libraries: GURL chromium, Rust-url and uriparser (based on the modern url test cases mentioned by @kmike https://raw.githubusercontent.com/web-platform-tests/wpt/master/url/resources/urltestdata.json).

There are 409 test cases in this json file. I tested each library's ability to parse scheme, netloc and path. The following table notes the amount of wrong result that each library has:

  Wrong scheme Wrong netloc Wrong path
rust 0 0 0
uriparser 0 11 18
gurl-cython 0 34 56

The result in detail can be found in here.

And the repo to the testing code can be found here. Let me know what you guys think 😃

Gallaecio commented 3 months ago

Interesting read: https://daniel.haxx.se/blog/2022/01/10/dont-mix-url-parsers/