john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

Implement Trie for suffix detection #285

Closed elliotwutingfeng closed 1 year ago

elliotwutingfeng commented 1 year ago

suffix_index() now uses a trie, which eliminates some string concatenation overhead in the old implementation for about 10-15% reduction in execution time.

Helps #175.

Python 3.10, Linux x64, Ryzen 7 5800X

import tldextract

%timeit tldextract.extract("")
%timeit tldextract.extract("com")
%timeit tldextract.extract("example\u3002com")
%timeit tldextract.extract("subdomain\uff0eexample\uff61com")
%timeit tldextract.extract("a\u3002very\uff0elong\uff61subdomain\u3002example\uff0ecom")
%timeit tldextract.extract("an\uff61even\u3002longer\uff0eand\uff61complex\u3002subdomain\uff0eexample\uff61com")
%timeit tldextract.extract("https://a\u3002b\uff0ec\uff61d\u3002e\uff0ef\uff61g\u3002h\uff0ei\uff61j\u3002k\uff0el\uff61m\u3002n\uff0eoo\uff61pp\u3002qqq\uff0errrr\uff61ssssss\u3002tttttttt\uff0euuuuuuuuuuu\uff61vvvvvvvvvvvvvvv\u3002wwwwwwwwwwwwwwwwwwwwww\uff0exxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\uff61yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy\u3002zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz.tw")
%timeit tldextract.extract("\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61")
1.44 µs ± 2.17 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.63 µs ± 12 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.95 µs ± 25.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
2.14 µs ± 75.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.31 µs ± 15.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.38 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
3.38 µs ± 22.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.28 µs ± 19.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Changes

john-kurkowski commented 1 year ago

Beautiful. 🙏