john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.84k stars 210 forks source link

Optimise suffix_index() for URLs with long subdomains #283

Closed elliotwutingfeng closed 1 year ago

elliotwutingfeng commented 1 year ago

Current implementation of suffix_index() searches for longest matching TLDs starting from the left side.

This disadvantages URLs with long subdomains like a.very.long.subdomain.example.co.uk:

By searching from the right side instead, we can reduce the number of steps to:

Since we are now searching from the right, we can optimise suffix_index() even further by moving calls to _decode_punycode() into suffix_index()'s loop; this reduces execution time even for short URLs by converting label to punycode only when necessary.

I also found that str.replace 3 times + str.split is consistently faster than re.compile + re.split. In fact, the performance gap is wider for larger strings with many unicode dots to replace (see the last 2 test cases in the benchmarks).

Benchmarks

Unicode dots are used. On average 10%-40% reduction in execution time. Time savings would be even longer for very long subdomains. The last line is a large string filled with all 3 non-ascii dots.

Helps #175. A trie should be much faster, but it is more complex to implement correctly. Perhaps in a future PR?

Python 3.10, Linux x64, Ryzen 7 5800X

import tldextract

%timeit tldextract.extract("")
%timeit tldextract.extract("com")
%timeit tldextract.extract("example\u3002com")
%timeit tldextract.extract("subdomain\uff0eexample\uff61com")
%timeit tldextract.extract("a\u3002very\uff0elong\uff61subdomain\u3002example\uff0ecom")
%timeit tldextract.extract("an\uff61even\u3002longer\uff0eand\uff61complex\u3002subdomain\uff0eexample\uff61com")
%timeit tldextract.extract("https://a\u3002b\uff0ec\uff61d\u3002e\uff0ef\uff61g\u3002h\uff0ei\uff61j\u3002k\uff0el\uff61m\u3002n\uff0eoo\uff61pp\u3002qqq\uff0errrr\uff61ssssss\u3002tttttttt\uff0euuuuuuuuuuu\uff61vvvvvvvvvvvvvvv\u3002wwwwwwwwwwwwwwwwwwwwww\uff0exxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\uff61yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy\u3002zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz.tw")
%timeit tldextract.extract("\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61")

Before

1.77 µs ± 13.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.92 µs ± 9.67 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
2.62 µs ± 6.85 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
3.26 µs ± 2.66 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
4.99 µs ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
6.24 µs ± 9.04 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
26.3 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
164 µs ± 369 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

After Changes 1-3

1.76 µs ± 9.35 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.91 µs ± 12.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
2.62 µs ± 9.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
3.1 µs ± 7.18 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
3.39 µs ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
3.64 µs ± 61.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
6.91 µs ± 64 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
11.4 µs ± 45.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

After Changes 1-4 (re.split replaced with str.replace + str.split)

1.66 µs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.84 µs ± 9.78 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
2.46 µs ± 7.12 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.88 µs ± 3.83 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
3.09 µs ± 31 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
3.21 µs ± 11.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
4.01 µs ± 20.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
5.26 µs ± 25.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Changes

  1. Added test cases for exceptions ! and *., and edge cases for .za (no first level TLD).
  2. Refactored suffix_index() to search for longest matching TLDs starting from right side instead of left side.
  3. Moved _decode_punycode() call into suffix_index()
  4. Replaced re.split with str.replace + str.split
john-kurkowski commented 1 year ago

Also thank you for a very thorough PR description and benchmarks!