pogzyb / asyncwhois

Python WHOIS and RDAP utility for querying and parsing information about Domains, IPv4s, IPv6s, and AS numbers
MIT License
63 stars 18 forks source link

Server detection bug + Updated BR parser #98

Closed antalgu closed 1 week ago

antalgu commented 1 week ago

Hello again, we've been doing an extensive amount of tests with the library and noticed a few things, the first of which is an important bug. When doing whois queries by domain, the function aio_run passes the search_term (which is the domain) to both the base aio_run and also the _get_server_name (if it doesn't have a server), however _get_server_name wants the tld as else it can't find the proper server. https://github.com/pogzyb/asyncwhois/blob/7cb3527959b6960299f72df8a08a20daa298d1fa/asyncwhois/query.py#L213-L216 This is easily solved by passing both things to aio_run and then passing each variable to the corresponding function, this also happens on the non aio version.

A second thing is that, on _aio_do_query, each time you do a query you store the output and try to parse the response for a preferred WHOIS server name: https://github.com/pogzyb/asyncwhois/blob/7cb3527959b6960299f72df8a08a20daa298d1fa/asyncwhois/query.py#L159-L186 When no server is found or you do aio_run without server, the chain + regex comes in handy as it finds a whois server to go to, and also even when there's already a server it sometimes redirects you to another one (which maybe has better information but i don't know if its the case), which is the case for: domain: "seanho.com" whois_server: "whois.porkbun.com"

However, we found some examples where the first call would get good information, it would find a server, make a second call that didn't respond, and then crash. This is the case with: domain: "correoparaguayo.top", whois_server: "www.gname.com" and the information that query_output provided is:

"Domain Name: correoparaguayo.top\r\nRegistry Domain ID: D20241104G10001G_33606946-top\r\nRegistrar WHOIS Server: www.gname.com\r\nRegistrar URL: http://www.gname.com\r\nUpdated Date: 2024-11-04T09:10:03Z\r\nCreation Date: 2024-11-04T09:09:31Z\r\nRegistry Expiry Date: 2025-11-04T09:09:31Z\r\nRegistrar: Gname.com Pte. Ltd.\r\nRegistrar IANA ID: 1923\r\nRegistrar Abuse Contact Email: complaint@gname.com\r\nRegistrar Abuse Contact Phone: +65.6531989391\r\n...

We noticed that commenting the recursive call:

        '''
        # parse response for the referred WHOIS server name
        whois_server = self._find_match(regex, query_output)
        whois_server = whois_server.lower()
        if (
            whois_server
            and whois_server != server
            and not whois_server.startswith("http")
        ):
            # recursive call to find the authoritative server
            chain = await self._aio_do_query(
                whois_server, data, self.whois_server_regex, chain
            )
    # return the WHOIS query chain
    '''

Would not only solve this problem, but it would also make the library blazingly fast (it was already very fast before, but it improved quite a bit), it also didn't seem to affect much the data integrity (take this with a grain of salt, i will test it better tomorrow morning). The "correoparaguayo.top" seems like a corner case, however it isn't so rare, we found a few more (can't recall them, too much tests), it would be great if there was a way to first of all, be able to "set" a max depth (as for us, for example, max depth = 0 would seemingly work (i'll have to check completely tomorrow)) and also to have a bit more strict control on these recursive calls so it tries to call something and if it would crash, for it to get handled and return the data that it already had. (You would need to have sub timeouts or something, right now the general query has a timeout of 10 seconds i think, maybe the first one would need to have 5 seconds and then a max of 2 recursive calls with 2 seconds each? I don't know, that's a thing to decide)

And lastly, we found several problems when dealing with .br datetimes, these got solved by adding this:

    TLDBaseKeys.CREATED: r"created: *([^#\n]*)",
    TLDBaseKeys.UPDATED: r"changed: *(.+)",

To:

https://github.com/pogzyb/asyncwhois/blob/7cb3527959b6960299f72df8a08a20daa298d1fa/asyncwhois/tldparsers.py#L181-L189

If anything is not clear, or you want me to give more examples or paste the code or anything feel free to ask, as I've put quite a big wall of text and everything i've explained might not be so clear.

pogzyb commented 1 week ago

Hi,

Thanks again for the clear explanation and providing examples! I'm happy that this library is useful and I appreciate your help with making it better.


When doing whois queries by domain, the function aio_run passes the search_term (which is the domain) to both the base aio_run and also the _get_server_name (if it doesn't have a server), however _get_server_name wants the tld as else it can't find the proper server.

Oh wow! Good observation. This will always add an unnecessary call to IANA. I'll get a PR together right away to fix this.


When no server is found or you do aio_run without server, the chain + regex comes in handy as it finds a whois server to go to, and also even when there's already a server it sometimes redirects you to another one (which maybe has better information but i don't know if its the case), ...

Yes, this is exactly right. Ultimately, we want the "authoritative" response, which will have the most accurate information. In most cases, we can find the server information for the TLD in server/domains.py , otherwise the code will default to starting with a call to IANA (currently happening because of the bug you pointed out).

... Would not only solve this problem, but it would also make the library blazingly fast (it was already very fast before, but it improved quite a bit), it also didn't seem to affect much the data integrity (take this with a grain of salt, i will test it better tomorrow morning).

I see how doing this would make things much faster, but yes the trade-off is "data integrity". Stopping before getting to the authoritative source will likely mean that responses are less accurate or lacking certain items like registrant information. Like you pointed out, there are certainly instances were non-authoritative responses are perfectly OK, but it's hard to judge. If during you're testing you find that there a majority of cases where "stopping early" yields good results, then I would be interested in implementing a flag or switch for this.

The "correoparaguayo.top" seems like a corner case, however it isn't so rare, we found a few more (can't recall them, too much tests), it would be great if there was a way to first of all, be able to "set" a max depth.

I agree that "correoparaguayo.top" falls more into the edge-case category. Typically, I would stick to handling them via string comparisons like so:

          if (
              whois_server
              and whois_server != server
              and not whois_server.startswith("http")  # edge-case
              and not whois_server.startswith("www.")  # edge-case
          ):
              # recursive call to find the authoritative server
              chain = await self._aio_do_query(
                  whois_server, data, self.whois_server_regex, chain
              )

Again, happy to change things though if you feel there's a better solution. I do something very similar to your "max depth" idea in the whodap project for RDAP queries:


def _get_authoritative_response(
        self, href: str, seen: List[str], depth: int = 0
    ) -> Optional[httpx.Response]:
        """
        Makes HTTP calls to RDAP servers until it finds
        the authoritative source.

        :param href: href containing the location of an RDAP
        :param depth: recursion counter
        :return: `httpx` response object
        """
        resp = self._get_request(href)
        try:
            self._check_status_code(resp.status_code)
        except NotFoundError:
            # Occasionally, gTLD RDAP servers are wrong or not fully implemented.
            # If we've succeeded previously, but now the href returns a 404,
            # return None, so the last OK response is returned.
            if depth != 0:
                return None
            else:
                raise
        # check for more authoritative source
        try:
            # If for some reason the response is invalid json, then just return None.
            # This may happen if we request an authoritative href that is not actually
            # rdap+json, but instead a webpage/html.
            rdap_json = resp.json()
        except JSONDecodeError:
            return None
        links = rdap_json.get("links")
        if links:
            next_href = self._check_next_href(href, links)
            if next_href and next_href not in seen:
                seen.append(next_href)
                resp = (
                    self._get_authoritative_response(next_href, seen, depth + 1) or resp
                )
        # return authoritative response
        return resp

And lastly, we found several problems when dealing with .br datetimes, these got solved by adding ...

Thanks so much! That should be an easy fix.

antalgu commented 1 week ago

Hello again and thanks for the fast responses and changes!

I've been testing the data-integrity difference between the two methods, getting the response from the first server and parsing it vs doing a recursive checking to try to find the authoritative response, and found out that, at least for our case, the option to be able to choose a max depth or some way not to enter into the recursive callings would be very benefitial.

Even though the recursive calls ended up giving more information overall when they worked, we found out that there were a ton of calls that would return as if there was no information for that domain (when initially they had information) or that would end up crashing (and therefore not returning any information), and as we're actually only interested on the created and updated dates, working with the first query is nearly optimal for us.

It's quite possible that in your use case this doesn't happen as often because it depends quite a bit on the type of domains that's being checked, but in our case it was pretty relevant: (first is without recursive calling, second is the current library version) modified base

I was gonna edit a fork and propose changes for this directly but don't know if, since you already have a similar system for rdap, you already had an idea in mind.

Thanks for everything!

pogzyb commented 1 week ago

Wow, that is interesting. In that case, adding this as a feature makes sense. Honestly, I think most of the developers who use this library are primarily concerned with retrieving the created/updated/expires dates, so this could be a nice addition for everyone.

I was gonna edit a fork and propose changes for this directly but don't know if, since you already have a similar system for rdap, you already had an idea in mind.

Sounds good - please go ahead and open a PR with your changes. I can review or modify if needed.

Thanks!

antalgu commented 1 week ago

Created the pull request. These are the tests when using no max_depth and using 0: Captura de pantalla 2024-11-05 162718 And this is before the changes (the same as using no max_depth): Captura de pantalla 2024-11-05 164819 If you can, test them and If they're okay squash them (On the second commit I only changed spacing)