Current implementation of suffix_index() searches for longest matching TLDs starting from the left side.
This disadvantages URLs with long subdomains like a.very.long.subdomain.example.co.uk:
a.very.long.subdomain.example.co.uk
very.long.subdomain.example.co.uk
long.subdomain.example.co.uk
subdomain.example.co.uk
example.co.uk
co.uk match
By searching from the right side instead, we can reduce the number of steps to:
uk
co.uk match
Since we are now searching from the right, we can optimise suffix_index() even further by moving calls to _decode_punycode() into suffix_index()'s loop; this reduces execution time even for short URLs by converting label to punycode only when necessary.
I also found that str.replace 3 times + str.split is consistently faster than re.compile + re.split. In fact, the performance gap is wider for larger strings with many unicode dots to replace (see the last 2 test cases in the benchmarks).
Benchmarks
Unicode dots are used. On average 10%-40% reduction in execution time. Time savings would be even longer for very long subdomains. The last line is a large string filled with all 3 non-ascii dots.
Helps #175. A trie should be much faster, but it is more complex to implement correctly. Perhaps in a future PR?
Current implementation of suffix_index() searches for longest matching TLDs starting from the left side.
This disadvantages URLs with long subdomains like
a.very.long.subdomain.example.co.uk
:By searching from the right side instead, we can reduce the number of steps to:
Since we are now searching from the right, we can optimise suffix_index() even further by moving calls to _decode_punycode() into suffix_index()'s loop; this reduces execution time even for short URLs by converting label to punycode only when necessary.
I also found that str.replace 3 times + str.split is consistently faster than re.compile + re.split. In fact, the performance gap is wider for larger strings with many unicode dots to replace (see the last 2 test cases in the benchmarks).
Benchmarks
Unicode dots are used. On average 10%-40% reduction in execution time. Time savings would be even longer for very long subdomains. The last line is a large string filled with all 3 non-ascii dots.
Helps #175. A trie should be much faster, but it is more complex to implement correctly. Perhaps in a future PR?
Python 3.10, Linux x64, Ryzen 7 5800X
Before
After Changes 1-3
After Changes 1-4 (re.split replaced with str.replace + str.split)
Changes
!
and*.
, and edge cases for.za
(no first level TLD).