pogzyb / asyncwhois

Python WHOIS and RDAP utility for querying and parsing information about Domains, IPv4s, IPv6s, and AS numbers
MIT License
63 stars 18 forks source link

Benchmarking asyncwhois against registrant name #62

Closed baderdean closed 5 months ago

baderdean commented 1 year ago

Parsing Whois data is hard, especially because the format differ depending on the TLD. I've a specific issue with registrant value. So I decided to test multiple python whois library (pythonwhoisalt, asyncwhois, whoisit, whoisdomain) against the registrant field using google's domain dataset and check for their speed too. Initially, I was using whoisdomain, so here the initial post on its github: https://github.com/mboot-github/WhoisDomain/issues/21

Here the script I wrote: https://gist.github.com/baderdean/cc4643ecd95d3ccde31dee80ebdbea28

Asyncwhois was the best in term of quality, yet slower than whoisdomain by far. Is it something some one could reproduce to tell if that's related to my specific case or is it generic?

And here the results:

{'asyncwhois': {'count': 49,
                'duration': 285.84409061399947,
                'percentage': 26,},
 'whoisdomain': {'count': 44,
                 'duration': 195.54051797400007,
                 'percentage': 24},
 'whoisit': {'count': 6,
             'duration': 27.91238160300054,
             'percentage': 3,},
'pythonwhoisalt': {'count': 7,
                    'duration': 1055.0711162629996,
                    'percentage': '4%'}}

PS: I've created similar issues in other projects as well.

pogzyb commented 1 year ago

This is really cool. I appreciate empirical-based approaches to decision making, so thanks for putting this together.

I honestly haven't looked into the other projects' methods for query submission (network i/o) but I'd assume they're probably very similar performance-wise (besides the asyncio stuff here). I'd bet that the majority of the slowness has to do with the text "parsing". For instance, asyncwhois uses tldextract to parse every url or domain, doesn't "compile" any regex's, and sometimes there are some funky "for" loops for some of the "odd" TLD query responses, but the trade-off is (for the most part) better quality output.

But with all that said, I'll definitely see if I can dive a little deeper into to your benchmark script and try to identify what's causing a majority of the slowness and get back to you.

baderdean commented 1 year ago

I've retried but I've modified asyncwhois to use async by default, yet still slower than whoisdomain https://gist.github.com/baderdean/cc4643ecd95d3ccde31dee80ebdbea28#file-whoisdomain-benchmark-py

The result is a bit different, a bit faster but the weird thing is that the result is different everytime I rerun it:

❯ ./whoisdomain-benchmark.py asyncwhois whoisdomain
Parsing with parsers: ['asyncwhois', 'whoisdomain']
{'asyncwhois': {'count': 47,
                'duration': 221.74783188999936,
                'percentage': '25%'},
 'whoisdomain': {'count': 42,
                 'duration': 166.05002470899854,
                 'percentage': '22%'}}
pogzyb commented 1 year ago

The main reason you're going to get different speeds each time is due to network i/o. Either the whois server will take slightly longer to respond because it's handling other requests, your IP is rate-limited by the server, or some other network-related jitter will skew your results. (Sometimes a country's whois server will go completely offline without warning for maintenance or some other government event happening.)

All of these libraries are going to have to deal with that. The only advantage asyncwhois gives you is that it supports asyncio meaning it can take advantage of cooperative multitasking and better communicate to other parts of your program that it's about to do some i/o related stuff, pause its execution, and allow other parts of your program to run their instructions. However, your script is performing each query sequentially, so there's really not much of a difference in using the asyncio vs normal methods here.

Given this information, your benchmark script is still solid; you'll either need to run it multiple times (like 10-100) to get more robust, averaged results or shift your focus more on the memory usage/speed/quality of the parsing abilities of each library.

This library decouples the "query" from the "parsing". That is, the logic for doing the network stuff and the logic for parsing the big text blob from the server are separated. Check out parse_tld.py; in there you'll find DomainParser which takes a text blob and top level domain as input, initializes a "parser, and then parses the text into a dictionary. If you shift your script's focus to parsing speeds then that may help.

baderdean commented 1 year ago

Hello,

Thanks for your serious and solid answer 👍

Yes, the script I've done has several limits to get solid conclusions but there is some trends. Plus the differences in the result themselves is surprising.

One of mitigation for asyncwhois I could suggest is to add the capacity (as an option like RDAP) to use the whois Unix tool instead of python networking for multiple reason: native caching, faster, less chance to be blocked by fingerprinting since it's seen as a legit tool.

Thanks for your work!

pogzyb commented 8 months ago

@baderdean I thought this might interest you. I used an LLM to parse WHOIS data and it does pretty well. It has some drawbacks like speed, but overall LLM's seem like great candidates to simplify regex related problems like WHOIS parsing.

source: https://github.com/pogzyb/port43?tab=readme-ov-file#basic-example-whois

baderdean commented 8 months ago

Thanks for the study!

Imho here LLM is more useful to write the parsing code or regex rather than parsing on the fly for performance and cost purpose.