meeb / whoisit

A Python library to RDAP WHOIS-like services for internet resources such as ASNs, IPs, CIDRs and domains
BSD 3-Clause "New" or "Revised" License
76 stars 20 forks source link

domain entity: handle requirement #5

Closed bpereto closed 2 years ago

bpereto commented 2 years ago

Hi,

saw your library today and was amazed. cool work - also for the possibility to override the bootstrap. (nic.ch is also not yet submitted in iana ;-) )

I see a problem with the requirement of a handle in the domain parsing: https://github.com/meeb/whoisit/blob/80dc56346d99565d4656a240b253d381db5be48b/whoisit/parser.py#L167

The rdap response profile defines this for domains: https://www.icann.org/en/system/files/files/rdap-response-profile-15feb19-en.pdf

Section 3.2 - for registries

Contacts (Admin, Technical) - The RDAP response SHOULD contain at least two entities​, with the ​ administrative​ and ​ technical​ roles respectively within the ​ entity with the ​ registrar​ role. The ​ entities​ with the ​ administrative​ and ​ technical​ roles MUST contain valid ​ fn​, ​ tel​, ​ email​ members, and MAY contain a ​ handle and a valid ​ adr​ element

so entities can MAY contain a handle for the admin and technical role. (its a MUST for the registrar, but not for these two). can we remove the enforcing of a handle in the extract_entities?

meeb commented 2 years ago

Thanks for the comments!

Regarding the requirement for a handle on entities, this was for a number of reasons when I was testing the library with various RDAP servers:

  1. A suprising number of RDAP servers are not fully compliant with the specification and return weird formats
  2. The handle is most commonly used to look up the registrant to try and find details on the entity, for example, you look up an allocation which returns entities, ideally to get any information on the registrant you need then do a query for the registrant entity and this requires the handle for the API call as the entity RDAP queries are all referenced by handle

That limitation can be removed, however, results were "more useful" with it in place as almost all of the records that were caught by the "do you have a handle" check were junk or pointless (empty fields, results with just notes in, placeholders etc.). It's not totally accurate as per the spec as you've noticed, but it seemed to be more real world useful when attempting to actually query objects.

I could make it a config flag / opt-in param as well if that would be suitable. Can you give a use case where this check might filter out a record that shouldn't be filtered out? You can always just use raw=True as well and bypass the whoisit parser entirely.

Feel free to submit a PR to add the .ch RDAP endpoint, the format is pretty straight forward in https://github.com/meeb/whoisit/blob/main/whoisit/overrides.py

bpereto commented 2 years ago

I experienced the same, that not all rdap providers are complient to the minimal rdap_level0 and do not follow the spec.

True, the parsing with raw=True is a workaround but misses the point of a handy summary of the roles.

I'm curently re-reading the specification and I get the conclusion, that rdap from TLD-CH does not quite follow the spec, as the registrar role contact entity does not contain a handle.

In response to registrar queries, the returned RDAP response MUST be an ​ entity​ with ​ registrar​ role, with a ​ handle​ and valid elements ​ fn​, ​ adr​, ​ tel​, ​ email​

With the current code you get in the best case a registrar, tech, admin contact entity or in the worst case, nothing :)

Here an example:

>>> import whoisit
>>> whoisit.overrides.iana_overrides['domain'].update({'ch': ['https://rdap.nic.ch/']})
>>> whoisit.bootstrap(overrides=True)
True
>>> response = whoisit.domain('test.ch')
>>> import pprint
>>> pprint.pprint(response)
{'copyright_notice': '',
 'description': [],
 'entities': {},
 'expiration_date': None,
 'handle': 'TEST.CH',
 'last_changed_date': None,
 'name': 'test.ch',
 'nameservers': ['ns1.cyon.ch', 'ns2.cyon.ch'],
 'parent_handle': '',
 'registration_date': datetime.datetime(1996, 11, 7, 0, 0),
 'rir': '',
 'status': ['active'],
 'terms_of_service_url': '',
 'type': 'domain',
 'url': '',
 'whois_server': ''}
>>> response = whoisit.domain('test.ch', raw=True)
>>> pprint.pprint(response)
{'entities': [{'objectClassName': 'entity',
               'roles': ['registrar'],
               'url': 'https://www.kreativmedia.ch',
               'vcardArray': ['vcard',
                              [['version', {}, 'text', '4.0'],
                               ['org', {}, 'text', 'Kreativ Media GmbH'],
                               ['adr',
                                {},
                                'text',
                                ['',
                                 '',
                                 'Höschgasse 45',
                                 'Zürich',
                                 '',
                                 '8008',
                                 'CH']],
                               ['kind', {}, 'text', 'group']]]}],
 'events': [{'eventAction': 'registration', 'eventDate': '1996-11-07'}],
 'handle': 'test.ch',
 'ldhName': 'test.ch',
 'nameservers': [{'ipAddresses': {'v4': ['194.126.200.5'],
                                  'v6': ['2a01:ab20::2']},
                  'ldhName': 'ns1.cyon.ch',
                  'objectClassName': 'nameserver'},
                 {'ipAddresses': {'v4': ['91.206.24.2'],
                                  'v6': ['2001:67c:234::2']},
                  'ldhName': 'ns2.cyon.ch',
                  'objectClassName': 'nameserver'}],
 'notices': [{'description': ['This information is subject to an Acceptable '
                              'Use Policy.'],
              'links': [{'href': 'https://www.nic.ch/terms/aup/',
                         'rel': 'alternate',
                         'type': 'text/html'}],
              'title': 'Acceptable Use Policy (AUP)'}],
 'objectClassName': 'domain',
 'rdapConformance': ['rdap_level_0'],
 'secureDNS': {'delegationSigned': False},
 'status': ['active'],
 'switch_name': 'test.ch'}

in this example you see 'entities': {}, entities is empty, as in the raw response there is a entity with role registrar . Due to the swiss law: Since 1 January 2021, personal data associated with registered domain names is no longer disclosed. Information about holders of domain names can only be obtained in exceptional cases.

thanks for the discussion. I will probably stick to parsing the raw data. and ping the TLD for the inclusion for the handle :)

meeb commented 2 years ago

Thanks for the example. I've just released v2.4.2 which you can upgrade to now. This includes the following commits:

https://github.com/meeb/whoisit/compare/v2.4.1...v2.4.2

The behaviour now is:

>>> from pprint import pprint
>>> import whoisit
>>> whoisit.bootstrap(overrides=True)
True
>>> results = whoisit.domain('test.ch')
>>> pprint(results)
{'copyright_notice': '',
 'description': [],
 'entities': {'registrar': [{'name': 'Kreativ Media GmbH',
                             'type': 'entity',
                             'url': 'https://www.kreativmedia.ch'}]},
 'expiration_date': None,
 'handle': 'TEST.CH',
 'last_changed_date': None,
 'name': 'test.ch',
 'nameservers': ['ns1.cyon.ch', 'ns2.cyon.ch'],
 'parent_handle': '',
 'registration_date': datetime.datetime(1996, 11, 7, 0, 0),
 'rir': '',
 'status': ['active'],
 'terms_of_service_url': '',
 'type': 'domain',
 'url': '',
 'whois_server': ''}

I re-ran some checks and the handle check was OK to remove, it was added quite early on and other checks to remove junk results were added afterwards so it wasn't doing a great deal other than filtering non-spec compliant entities.

bpereto commented 2 years ago

thank you.

just for clarification what i learned from my research:

so I conclude that all gTLDs should have a well defined rdap response profile conforming to icann_rdap_response_profile_0.

All the other TLDs, and in the most cases the ccTLDs implementations, have no requirements what they must or should return, only what properties/members are available and can be used.

meeb commented 2 years ago

Thanks, that's generally what I'd discovered as well. The whoisit parser likely won't ever be fully compliant given it has to handle some potentially invalid upstream responses. Feel free to report any other issues if you find any with data being over or under extracted.