richardpenman / whois

MIT License

346 stars 180 forks source link

Philosophical changes and future development #164

Open mzpqnxow opened 1 year ago

mzpqnxow commented 1 year ago

Thanks for your work on this project; it's come a long way from just a transliteration/functional Python fork of the C whois implementation and I use it in quite a bit of automation. I appreciate your openness to PRs and your handling of issues

What follows is not a typical issue. It's some opinion, some analysis, and an offer that I don't necessarily expect you to accept

Proposed Change

First, an obvious observation: much of the activity in Issues/PRs is "add support for xyz TLD/gTLD". You of course know that this usually requires one of the following:

A class for the TLD/gTLD
enhancements/modifications to the existing class if applicable (changes to the regex)
Addition of conditionals and/or imports (of the TLD parser class) in many cases

There are some things in the current implementation that can be done differently that provide some benefit- either technical changes, or changes to the higher level approach

I couldn't find the note I thought I had posted here a few years ago- it can summarized as "are you open to refactoring?" to which you responded (iirc) with an enthusiastic "yes". If that's no longer the case (or was imagined) then you can ignore the rest of this 😀

The local fork I have makes a foundational change to the way individual TLDs are handled in the code. Rather than requiring a large number of imports or a very large file for the classes, I used a pluggable approach; each of the TLD classes lives as a class in a module under a whois.tlds package. So, along the lines of whois.tlds.com::WhoisCom

Each class still inherits from a common one for the regex and such, but has an extra attribute:

tld: str

To identify which class to use while also avoiding hundreds if imports, I added a module that uses importlib to iterate over all of the modules at run-time, to build a global mapping of TLD -> class, by iterating over all of them and referencing the new tld attribute

I'm not sure if I dis this in the Pythonic way, but it's pretty small, only a few hundred lines or so. And it's mostly hidden behind a simple interface, along the lines of:

gets WhoisCom

cls = get_tld_class("com") w = cls(data)

This means many of the long if/elif/elif stuff can go away. Additionally, all of the imports can go away. It's also (imo) simpler to make changes or additions to, and of course easier to read, even if only because it's less to look at

State of the fork

What I have is functional and used in daily automation, but along the way I made a few invasive changes that may make it unappealing or difficult to accept as a single merge:

Removed the non-domain lookup functionality
Added filesystem-based caching of responses (very helpful during development and when used at scale in an automated fashion)
Lots of minor stuff that touches every file (style, aesthetics, trying to adhere to PEP-8 and whatever "black" prefers)
"Breaking" changes; specifically PEP-484 type annotations which I use somewhat religiously with mypy. These annotations break Python2

For item 1, I felt it was no longer useful to support non-domain lookups, because RDAP infrastructure for other objects is in place now, and RDAP is superior to WHOIS

For item 2, what I added is relatively simple and uses cache_to_disk for the caching. It's just a decorator and slightly special exception handling which (incidentally) I contributed specifically for this; you'll notice that the README example for that project is whois-themed :)

Are you still/were you ever interested in entertaining a somewhat ambitious refactor and/or adding the caching or typing? If so, what sort of time and interest might you have to work on it if I was to give you the functional code I have?

I don't expect you to take this on; I had fully intended to see this all the way through with my own time, via a proper public fork and PRs for individual changes, but I've had a lot of time commitments since then, so it's remained a private fork. It works great for what I do but I wouldn't expect any reasonable person to accept it as a PR; it's more invasive than is necessary and as I mentioned, some features were completely removed (including Python2 support)

I'm at the point where it seems like development of the fork will die in my own private (non-Github) git repo, as it's "good enough" for me as it is now

Parsing Patterns: Per-WHOIS server rather than per-domain

This is only a thought. I didn't implement it...

I noticed while working on the fork and (particularly) while using it to look up domains at a relatively large scale (~10k domains, across many hundreds of TLDs/gTLDs) that the format of the WHOIS response doesn't necessarily correlate with the TLD. Most correctly, it correlates with the WHOIS server implementation/name. One of the more common servers for example is the Donuts WHOIS server which is the authority for tons of TLDs- either directly or through a referral. The response data is uniform across many different TLDs/gTLDs

Seeing it so many times across different TLDs made me think it might be wiser to rewrite with the objective being having server-specific parsing classes, rather than TLD-specific ones. This way you end up with much fewer parsing classes, they may end up more reliable, and they'll cleanly support parsing responses for domains that aren't explicitly supported, without requiring use of a fallback class/pattern. Realistically many of the TLD-specific parser classes might need to stick around, but it's possible that most could go away. The logic could check first for a server-specific class, then a TLD-specific one

Functionally, this may not be a huge improvement- the fallback in the current project is pretty darned good and it's relatively rare that the response isn't parsed well-enough; but it could be a good opportunity to both eliminate a lot of boilerplate/repetitive code that has really added up over and to reduce the trickle of Issues and PRs along the lines of "add support for .bla TLD", when the enhancement is typically a matter of copy/paste/rename

Finally, the end

Any thoughts? I can push what I have locally into a GitHub fork if you'd like to see exactly how much effort is involved. Or, if you're not interested and/or don't have time, I can leave it to rest. I just don't have time to do this as I intended, which originally included cleaning it up more, restoring the few features I removed, and making it merge as cleanly as possible

Thanks again and kudos on this project. 5 years ago I would have said that WHOIS was nearing its death as a result of RDAP adoption, but from what I've read, we're far from that point when it comes to TLDs. RDAP adoption for domains has completely stalked for a lot of TLDs/gTLDs, unfortunately

I'll stop taking up your time now 😀

mzpqnxow commented 7 months ago

I have not had time to finish (or even clean up) what I have, unfortunately. I do still intend to get around to it, I just can't see any time for it in the next 6 months. It does fully work for my use-case as-is (it's not a "proof of concept") but there are lots of things preventing it from being suitable for a PR

If someone is willing to work on it, it requires a bit of time. The more familiar you are with the existing codebase, the quicker the work will be, though

If anyone is interested in doing any of these things, I can put up what I have as a temporary separate repo

I'd rather not go through the trouble of doing that if nobody currently has the time or interest to take it on yet, because like everyone, I'm short on time. But if anyone speaks up I will absolutely do it and link to it here

The main work required to make this even close to suitable for a PR follows... sorry for the large dump of items

Remove (or clean up and make more tunable) the caching I added. Not sure if anyone wants caching? I added it because my use-case involves large batches done on a regular basis, and doing thousands of queries daily results in blocks occasionally. Also, it makes dev/testing faster (instant responses. @richardpenman, any thoughts on having the caching? It's applied around the outermost UDP request for a given TLD. It doesn't cache failed responses (such as network failures) but it may cache "negative" responses (I don't recall with certainty)
Restore the IP whois feature(s). I stripped out the support for doing lookups on IPs because at this point, RDAP is implemented by all of the RIRs for IP addresses and it's superior. @richardpenman I assume you want the IP functionality to remain, correct? I don't want it, but it's understandable to want to keep it intact
Review the many, many existing classes and their respective regexes and consolidate them into as few classes as possible
Analyze the classes resulting from step 3 with the WHOIS server implementation (rather than a ABC, DEF, GHI class all having the same patterns);. This is a step towards shifting from TLD-specific classes to WHOIS server implementation classes
Lots of minor and major refactors need to be cleaned up in general
Remove/replace/improve logging I added during testing. In theory it can at least be silenced easily if it isn't desired as I used the Python logging package. But maybe it should be kept, in which case it needs cleanup and an interface to control it via the CLI
Restore the CLI/entry-point. I only implemented a very simple/limited CLI interface for what I needed, which is invoking a method with argv[1] and nothing more
Documentation, specifically for developers needing to add additional parsing. Though in theory, this really ought to become an even rarer thing

Regarding the third and fourth items...

One of the consequences of the current implementation is that there are many different classes (with slightly or very different regex patterns) for the exact same output format. This happened incrementally as users contributed classes and/or patterns as one-off enhancements to an existing or new TLD. In some cases the most comprehensive/correct class for the format observed was not copy/pasted
I have not formally enumerated and documented the various WHOIS implementations encountered. It's not necessarily hard but can be time consuming. So e really dirty logging I added (logging every raw response to disk) makes it easy to collect the data but it still requires eyes/time

This could be done in stages. One way to approach it would be to transition to the importlib-based approach of getting the parser for the response as the first step

It's a big task (big change to the code) but it's mostly done. Part of that task involves splitting out each existing TLD class into its own file, in a new subpackage/directory. So you have a structure like this:

whois/
  tld/
    __init__.py
    adult.py
    ba.py
    ...
    com.py
    ...
  ...

That was a painful task but I already have it done so it's easy to copy

The other parts are the module with a few functions that scan the TLD subpackage directory to build a simple registry (a dict if tld -> class) and simple logic to choose the class from said registry at runtime, so it can be instantiated for parsing

Sorry for the empty (so far!) promises and the very long post 😞

richardpenman commented 3 months ago

Sorry for delay replying. Busy with babies...

Yes am very open to refactoring, particularly if it improves test coverage. I have been accepting most PRs and haven't pushed back for tests.

Instead of iterating over all the modules at runtime to build the TLD map, could importlib be used to just import the module matching the given domain TLD?

I'm not familiar with RDAP, but often people in the issues have use cases for IP lookups, so I would be hesitant about removing this. Does maintaining support for IP lookups complicate the implementation?

Caching is a great idea during development. I personally use the pdict module for this use case.

Black / PEP-8 fixes would be welcomed. Also a github CI PEP-8 check to make sure this is maintained. I'm also open to removing Python2 support to simplify things. I've received a number of issues for Python3 but none for Python2 lately, so I think it's time to drop this.

Analyze the classes resulting from step 3 with the WHOIS server implementation (rather than a ABC, DEF, GHI class all having the same patterns);. This is a step towards shifting from TLD-specific classes to WHOIS server implementation classes

That is a great idea, though would it be harder to maintain in future when someone comes along with a missing TLD?

It would be ideal to get any of these changes in separate PRs though if you're still motivated, which given my delay in responding I don't expect!

richardpenman commented 3 months ago

Made some changes today to migrate codebase to pep-8 with black, removed python2 support, and added CI checks for flake8 and unit tests.

mzpqnxow commented 3 months ago

Sorry for delay replying. Busy with babies...

EDIT: Congrats?

That's ok, we'll deduct it from all the pay you receive for your thankless dedication to this project 😉

Yes am very open to refactoring, particularly if it improves test coverage. I have been accepting most PRs and haven't pushed back for tests.

👍

Instead of iterating over all the modules at runtime to build the TLD map, could importlib be used to just import the module matching the given domain TLD?

Good question. I vaguely recall a couple of reasons I chose walking the directory and disregarding the filename:

Handling TLDs with reserved or otherwise "special" characters in them. For example, "co.uk"; I wasn't sure if importlib would be happy with importing tlds/co.uk.py. Rather than looking further into it, or using a simple "hack" like replacing the dots with underscores, I dodged the issue. There are other ways of working around this beside the simple "." -> "_" method
Handling bad input. Another example of something that can be handled by a simple solution, like sanitizing the input. However, I preferred to just dodge the issue entirely, rather than allowing external input to influence an import at all. Ultimately a quirky/arbitrary judgement call I made in retrospect

Neither of these make your approach objectively unsuitable. I don't have much of a technical argument behind what I chose, nor do I feel too strongly about it as it's a reasonably simple implementation detail to change

What is your interest in the approach you mentioned? Efficiency, from not needing to open all the files? Or reduced code? Maybe both. Any others?

I'm not familiar with RDAP

You should look into it, the gist is that RDAP is effectively WHOIS but using HTTP with consistently structured JSON responses

It's really quite nice and allows (among other things) throwing away any pattern matching logic and UDP code. And for the users, it works with HTTP proxies (this is important for my use-case)

It's really a shame it hasn't been adopted as planned for TLDs... 5 years ago I honestly expected I would no longer have a use for this project - did not work out that way at all!

but often people in the issues have use cases for IP lookups, so I would be hesitant about removing this.

Fair enough

Does maintaining support for IP lookups complicate the implementation?

I don't recall how much it complicated things- it's possible I just made a hasty judgement call. It always feels nice to chop out a bunch of conditionals. I doubt it's a substantial hindrance

Caching is a great idea during development. > I personally use the pdict module for this use case.

👍

Black / PEP-8 fixes would be welcomed. Also a github CI PEP-8 check to make sure this is maintained

Are you familiar with pre-commit? It's very nice for helping to enforce this.. well.. pre-commit. You'll still want CI checks, but pre-commit reduces the number of reactive code changes (you may already have it in the repository, I'm on mobile and didn't check)

I'm also open to removing Python2 support to simplify things. I've received a number of issues for Python3 but none for Python2 lately, so I think it's time to drop this.

I see you already started on some of this 👍

Analyze the classes resulting from step 3 with the WHOIS server implementation (rather than a ABC, DEF, GHI class all having the same patterns);. This is a step towards shifting from TLD-specific classes to WHOIS server implementation classes That is a great idea, though would it be harder to maintain in future when someone comes along with a missing TLD?

It could be, but doesn't necessarily have to. It would have to be kept in mind because it would be a big point against any implementation that makes this task harder

It would be ideal to get any of these changes in separate PRs though

Yeah, makes sense. Would you do it directly to master or some new branch? I guess most users aren't impacted by changed to master unless it's published as a package, so maybe it doesn't matter, but a branch may be the best way to go to avoid confusing users installing from master (a terrible practice, but I'm sure someone is doing it...)

if you're still motivated, which given my delay in responding I don't expect!

I was the one with the longest delay, I think. I'm still motivated, but my time is still pretty limited unfortunately

I'll keep following this and hopefully can carve at least a little time out to lend a hand

mzpqnxow commented 2 months ago

More idle commentary to follow 😀

I don't think this necessarily requires a response, you can consider it food for thought, otherwise I'll consider it a note for myself. Could be you've already considered it

First, I was recently looking through my Github updates and noticed another project I had followed a long, long time ago, which is a whois parsing implementation in Ruby

https://github.com/weppos/whois-parser/tree/main/lib/whois/parsers

Because of our recent discussion, I went to look at how that project approaches it, to see if it gave me any new ideas or perspectives.

You may be interested in checking it out, or may already be familiar with it. My quick observation from a couple of quick glances- I don't know that there's anything particularly novel there. Though it seems it may leverage classes and language features more extensively. The result of that might be more elegant (subjective) but each subclass is also a lot bulkier than with your current approach. 🤷‍♂️

That code made me think of another point, which is the lowest level detail of the parsing. The current implementation uses a single regex string. Beautiful, as it's simple by many measures. But also ugly in some cases, because occasionally it gets ridiculous constructing a single regex for the more exceptionally unruly formats and records

Have you considered support in the base class for optional field-specific regex? This would be for really exceptional cases where a single regex is either extraordinarily difficult or seems not possible with a single regex

~~I have a few specific examples I'm going to dig up now. I'm proficient with regex but don't consider myself an expert. Maybe you can actually come up with a regex for them and I'll learn something new~~

Note: it's fair to say one can always extend the class further, as is the point of classes. True. I suppose it's a matter of taste/preference as to whether changes should be left to each subclass, or supported via some scaffolding in the base class. There's a threshold imo, when more than N subclasses need to implement something, it's possible it should be natively supported by a parent class

Thanks!

EDIT: I'm reviewing the code now and I don't know exactly what I was thinking - the regex is already field-specific - clearly it's been way too long since I looked at this. I need to catch up more on both your code and mine before I continue spamming the issue )

mzpqnxow commented 2 months ago

Regardless of my mistakes/failing memory, I had this in my notes as well as a "potentially helpful reference" with regards to parsing - I think seeing this and one other project taking the server-specific parsing route is probably what made me decide it wasn't an insane idea and might actually be worth raising :smile:

https://github.com/Whoisdoma/WhoisParser/blob/master/src/Config/whois.ini

(Lines 731 on have the parser info)

mzpqnxow commented 2 months ago

I gisted one specific piece of the changes on my side if you're curious about some of the larger/structural modifications. Here's the entry.py file in its entirety:

https://gist.github.com/mzpqnxow/28909f8803a99b2cf4f435d3db7e86b8

Some dubious choices for sure - like making it a subclass of UserDict. But I figured better to at least show you SOMETHING so you get an idea of what it looks like

The class "registry" file (registry.py) can be found here:

https://gist.github.com/mzpqnxow/2d3dc445a2033ef893417c525b718186

After your suggestion/question "can they just be loaded based on the name of the tld?" - I realize this does look needlessly complex. I'm still not 100% sure if I encountered an issue with the approach you described or if I just chose this method because it seemed better to me at the time :smile:

This note, however, is curious, and may answer that question (I have to review more closely to be certain):

                    # We don't break here; thinking of a case where there might be two matches, e.g. for
                    # something like .co.za and .za. Currently, this shouldn't happen with the current
                    # implementation, but future-proofing just in case. We just assert() that there's only
                    # one match at the end. It doesn't hurt much

This case doesn't necessary preclude what you described, though. And it would be irrelevant if you decide to switch to per-server parser classes

richardpenman commented 2 months ago

What is your interest in the approach you mentioned? Efficiency, from not needing to open all the files?

Yeah was thinking of efficiency - do you have an idea how long the imports would take? Apparently this can be measured with -X importtime: https://stackoverflow.com/a/52090274/105066

Would you do it directly to master or some new branch? I guess most users aren't impacted by changed to master unless it's published as a package, so maybe it doesn't matter

Master is fine and hopefully no production users are installing from the repo!

https://github.com/weppos/whois-parser/tree/main/lib/whois/parsers

Hadn't come across this. A pity we are re-inventing the wheel in each language!

the format of the WHOIS response doesn't necessarily correlate with the TLD. Most correctly, it correlates with the WHOIS server implementation/name.

Do you have an example? Actually I didn't realize this. If this is true then the server approach makes sense, but would be a huge change.

mzpqnxow commented 2 months ago

What is your interest in the approach you mentioned? Efficiency, from not needing to open all the files?

Yeah was thinking of efficiency - do you have an idea how long the imports would take? Apparently this can be measured with -X importtime: https://stackoverflow.com/a/52090274/105066

Should have done that, but alas, I did not :disappointed:

Would you do it directly to master or some new branch? I guess most users aren't impacted by changed to master unless it's published as a package, so maybe it doesn't matter

Master is fine and hopefully no production users are installing from the repo!

:+1:

https://github.com/weppos/whois-parser/tree/main/lib/whois/parsers

Hadn't come across this. A pity we are re-inventing the wheel in each language!

Indeed, same here

Ideally someone would/would have created some YaML files or some other language-agnostic content to describe the parsing, so that we could all use it by writing a small interpreter for that description in the language of our choice - and let The Community :tm: maintain the parsing data files. I actually started on something like that a long while ago, but it quickly got ugly and I wasn't willing to make the decisions about the tradeoffs required. At some point you have a few choices to make:

Accept that a LOT of redundant/boilerplate will be required, for servers that have overlapping parsing rules
Accept that the YaML will become slightly more complex if you want to avoid that first problem - you need to effect inheritance within YaML. It can at least be done in a lightweight fashion for the most part, if only one level of inheritance required and no multiple inheritance is required - by having a "special" key/value like inherit: base, instructing the engine to another YaML file first
Defeat the purpose of the entire thing and support the use custom deserialization methods (terrible idea)

Someone with good judgement , development skills and some time to dedicate could probably still do this and balance/minimize problems 1 and 2 while avoiding 3, but it's not going to current-me, sadly

the format of the WHOIS response doesn't necessarily correlate with the TLD. Most correctly, it correlates with the WHOIS server implementation/name.

Do you have an example? Actually I didn't realize this. If this is true then the server approach makes sense, but would be a huge change.

Actually, the best example handy can be found in the aforementioned project, rather than parse .com and .net with a class specific to each, they just check the final WHOIS server (in case of referrals) and then look up the parser based on that WHOIS server name:

https://github.com/weppos/whois-parser/blob/main/lib/whois/parsers/ccwhois.verisign-grs.com.rb

Or did you mean examples of different TLDs that go back to a common WHOIS server?

richardpenman commented 2 months ago

Or did you mean examples of different TLDs that go back to a common WHOIS server?

Did you mean that 2 domains with the same TLD use different WHOIS servers, and so have different formats? That's what I understood, but maybe you meant something else. If not then I don't think is worth the large restructuring and the .net parser could just inherit from the .com parser.

mzpqnxow commented 1 month ago

Or did you mean examples of different TLDs that go back to a common WHOIS server?

Did you mean that 2 domains with the same TLD use different WHOIS servers, and so have different formats? That's what I understood, but maybe you meant something else. If not then I don't think is worth the large restructuring and the .net parser could just inherit from the .com parser.

That's one example, but the most common one may be with the so-called Generic TLDs (gTLDs)

I don't know if you're familiar with how the gTLDs are different from traditional TLDs- if not, here's all you need to know:

The way gTLds work from a WHOIS perspective is how all classic TLDs should have worked as well- which is the owner of the gTLD MUST maintain an A or CNAME record for whois.nic.gTLD

For example, the WHOIS server for the .adult domain is guaranteed to be whois.nic.adult. It was a smart requirement and IANA appears to have strictly enforced it, probably by helping to make it easy

Anyway, what this means in practice is that for owners that either:

A. Have multiple gTLDs all going to one WHOIS server

Or..

B. Farm out the actual WHOIS service to a third party

And

C. Accomplish this via a WHOIS protocol level referral (rather than allowing DNS to make such a referral unnecessary, by having an A record or CNAME send a WHOIS client directly to the authoritative source on the first shot)

... have multiple TLDs that ultimately get data from the same WHOIS server, which is known to the WHOIS client, because of the referral

There are also a cases where countries own a gTLD and fall into the above configuration, where the gTLD WHOIS server has a referral to the countries standard WHOIS server

Anyway, it's still a relatively small number of the total. Less than 5% is my estimate based on some data I looked at a few weeks ago. But it wasn't comprehensive

The upside, compared to any any real or perceived complexity associated with implementating it, would be potentially handling TLD parsing perfectly, even when there's no explicit support for the TLD (or gTLD)

This is probably not an either/or decision - could be like this:

Check if the TLD has an explicit parser, if so, use that
If not, check if the final "leaf" WHOIS server has an explicit parser, if so, use that
If neither, fall back to the default parsing class (as you do today)

Regardless of whether you think the server-based parsing is desirable, it's probably a good idea to invest a little more into the fallback behavior. Maybe make it more comprehensive/robust, by trying several fallback classes/pattern-sets, returning results from the one with the best results, as measured by the most parsed fields

This approach may not be precisely this simple, and there would be a tradeoff with efficiency- for example, how many different pattern sets should be tested? But, the general idea is not crazy or infeasible, I think

Either way, I don't think any of these approaches bloat the code too much. Could be wrong about that of course

Last thing, because it just came to mind. I don't necessarily feel too strongly about this

There is currently an insane number of classes that are either exactly the same as others (but in their own file) or exactly the sane except for a tiny modification that makes the class either inferior to or superior to the one from which it was copied. Despite the data actually being the same. It's effectively just drift from the original class that got copied many times, where the original later had some enhancement

It would be great to prevent this from happening - if the data is the same for TLD x, y and z, and the parser for x is improved, they should all be improved

It's related somewhat to the per-server class idea, except it happens without referrals- think of all the TLDs that go directly to GoDaddy, MarkMonitor or Verisign, and that have the exact same underlying data format. I believe this is a LOT, and you can tell by comparing the parser classes (it's a lot simpler once you split them out into files, where you can list by size, and diff from the shell)

I don't know how this would be fixed and prevented- a gate in CI, via tests, a one-time semi-automated review, careful manual inspection of PRs, etc. All of the above? (Assuming you think it's worth tackling - it may not be a substantial enough problem to be worth time, though I'm happy to do the one-time review if you're interested)

mzpqnxow commented 1 month ago

Also, sorry for the jumbled up mess of a comment- I wanted to get back to you and wrote that up pretty hastily

mzpqnxow commented 1 month ago

I owe you lots more data, but I want to put a few notes here, they don't require responses. I will edit this update in the future, rather than create more noise with new comments:)

I don't suggest any of these are high-priority - they're really just notes for future reference

Not found vs rate-limited

From WhoisCz:

    not_found = "(% No entries found.|Your connection limit exceeded)"

Many registars have a distinct "you are rate limited" message - it would be nice to have a separate exception for that for all registrars (or at least as many as possible)

Observations on consistency / inconsistencies in regex for a given field

In the majority of cases, the value to the left of the colon is what's dynamic - the capturing part is typically the same, with probably some exceptions. But because of all of the incremental development over time, from various parties, there's a lot of reinvention of wheels, with some of said wheels not being quite so round

Example Case: Postal Code / Zip Code

Here's a quick summary (and a tally on the left, by how many times they occur) of how postal code/zip code are being parsed across the various classes in master as of 6/2024:

      1 (?:Administrativ contact\n.*:.*\n.*\n.*\n.*\n.*\n.*\n.*: )(.+)
      1 Administrative Contact Postal Code: *(.+)
      1 (?<=Administrative Contacts:)[\s\W\w]*?postal-code-loc:(.*)
      1 AdminPostal Code\.*: *(\d+)
      1 Admin ZipCode: *(.+)
      1 adm-zipcode: *(.+)
      1 Billing Contact Postal Code: *(.+)
      1 BillingPostal Code\.*: *(\d+)
      1 (?<=Contact)[\s\S]*Postal Code:(.*)
      1 org-zipcode:*(.+)
      1 (?:Owner Contact\n.*:.*\n.*\n.*\n.*\n.*\n.*\n.*: )(.+)
      1 Owner ZipCode: *(.+)
      1 postal code: *(.+)
      1 PostalCode: *(.+)
      1 registrant_contact_postalcode:\s*([^\n\r]+)
      1 RegistrantPostal Code\.*: *(\d+)
      1 Registrant\s*(?:.*\n){4}\s*Postalcode: *(.+)
      1 (?<=Registrant:)[\s\W\w]*?postal-code-loc:(.*)
      1 Registrant Zip Code\s*: *(.+)
      1 Registrar:\s*(?:.*\n){2}\s*(\S*)\s(?:.*)
      1 (?<=Registrar)[\s\S]*?Postal Code:(.*)
      1 (?<=Registrar:)[\s\W\w]*?abuse-postal:\s+(.*)\n
      1 Sponsoring Registrar Postal Code:(.+)
      1 (?:Technical contact\n.*:.*\n.*\n.*\n.*\n.*\n.*\n.*: )(.+)
      1 Technical Contact Postal Code: *(.+)
      1 TechPostal Code\.*: *(\d+)
      1 tec-zipcode: *(.+)
      2 Billing Postal Code: (.*)
      2 Billing Postal Code: *(.+)
      2 Tech Postal Code:(.+)
      3 Admin Postal Code: (.*)
      3 Registrant Postal Code: (.*)
      3 Tech Postal Code: (.*)
      4 Admin Postal Code:(.+)
      5 Admin Postal Code: *(.+)
      5 Registrant Postal Code:(.+)
      5 Tech Postal Code: *(.+)
     12 Registrant Postal Code: *(.+)

Or, reducing that down, accepting that we'll lose a couple of the really weird ones that have multiple colons:

› grep -Ei '(postal|zip)' parser.py | cut -d '"' -f 4 | sed -e 's|?:|?_|' | cut -d ':' -f 2- | sed -e 's|.*\?(|(|g' | sort | uniq -c | sort -n
      1 (?:.*)            <-- This one may be legit, .nl uses the annoying multiline style
      1 (.*)\n           <-- This does not appear to be necessary at first glance (for .ua)
      1 ([^\n\r]+)    <-- Also probably legit, from .eu using the multi-line style
      4 (\d+)
     15 (.*)
     51 (.+)

Presumably, the intention was for all to use (.+) , except for the three I called out as weird multi-line ones

I wonder if it might make sense to have and encourage/require use of a canned value for as many of the fields as possible, even when they're as simple as (.+). This would be consistent with how you currently provide EMAIL_REGEX. It makes it easier for contributors to do it right than to do it "wrong". And it makes everything behave the same. As it stands, non-US style postal codes (which may contain letters) and even the long-form US zip codes (\d+-\d+) will fail to parse for those few outliers using \d+

...

mzpqnxow commented 1 month ago

Here's a quick stab at a proposed change for the classes

There are some small details that change, it's still largely the scheme you implemented

This is separate from the idea about having a lookup to choose the parser class

What I like most about making the classes look more like this is that it lets new TLD classes skip having to implement any methods - even __init__(). You only need to define the attributes, put a docstring (so the interpreter can parse it) and you're good to go

Here's roughly what I had in mind:

from collections import UserDict
from typing import Any, Dict

import re

class PywhoisError(Exception):
    def __init__(self, message, text: str = None, cls: Any = None):
        super().__init__(message)
        self.text = text
        # Could receive the class, if you want to pull more info out I suppose
        self.cls = cls

class WhoisBase(UserDict):
    domain: str
    text: str
    not_found: str
    regex: Dict[str, str]

    def __init__(self, domain: str, text: str):
        UserDict.__init__(self)
        self.domain = domain
        self.text = text
        if re.search(self.not_found, text):
            raise PywhoisError("Not found !!", text=text)
        self._parse()

    def _parse(self) -> None:
        """Common parser function"""
        # The parsing logic
        # ...
        # self['registrar'] = whatever

class TLDBase(WhoisBase):
    """Not to be instantiated directly; extend for each TLD"""

    # Only needed for later, to implement lookup via the tld field if decided to do that
    tld: str

    # Parser defaults
    not_found = r"Default not found pattern"
    regex = {
        "domain_name": r"Domain name: *(.+)",
        # ...
    }

    def __init__(self, domain: str, text: str):
        self.domain = domain
        self.text = text
        WhoisBase.__init__(self, domain, text)

# A TLD class can be a little simpler/briefer now
# No code should be needed for most cases, in theory
class WhoisCom(TLDBase):
    tld = "com"
    regex = {
        "field1": "patt1",
        "field2": "patt2",
        # ...
    }
    # not_found = r"Not found"
    pattern = {
        # Same as before, still can optionally override regex ...
        "domain_name": r"Some special pattern that overrides the common one",
        # ...
    }

Some caveats ...

This might require some minor changes elsewhere There is probably a better way to do what I'm trying to accomplish, from a simplicity perspective or flexibility perspective

I chose what is probably a hacky approach, to avoid some of the issues with multiple inheritance (since the base class is inheriting from UserDict

I'm not a classically trained CS person, and I only play a Python developer part-time, so pardon and glaring stupidity :)

mzpqnxow commented 1 month ago

@richardpenman I spent a few hours this morning on a pretty large refactor (or, it felt large - I guess maybe not)

The main focus was simplifying the sub-classes - removing the need for __init__ in each one

The summary of what I did is at the top of the README here: https://github.com/mzpqnxow/whois_/tree/boilerplate-reduction-refactor

Not sending a PR, the README will explain why - while it does "work", it's not complete - I made things worse in some places, especially depending on your taste and preference. That's OK, it's a work in progress

Happy to collaborate in whatever way works for you, on whatever timetable

I actually did something! :>