meeb / whoisit

A Python library to RDAP WHOIS-like services for internet resources such as ASNs, IPs, CIDRs and domains
BSD 3-Clause "New" or "Revised" License
78 stars 20 forks source link

IP Contact address field isn't parsed. #43

Closed jwahsnakupaku closed 1 week ago

jwahsnakupaku commented 1 week ago

Hi,

IP Address contacts don't appear to be getting parsed. Looks like they aren't in the expected fields? Not sure if this is common to all IP records or just the 10 or so I've looked at.

eg; for 8.8.8.8 - https://rdap.arin.net/registry/entity/ABUSE5250-ARIN

import whoisit
>>> whoisit.version
'3.0.4'
>>> whoisit.bootstrap()
>>> r = whoisit.get('8.8.8.8')
>>> r.get('entities').get('abuse')[0].get('address')
{'po_box': '', 'ext_address': '', 'street_address': '', 'locality': '', 'region': '', 'postal_code': '', 'country': ''}

>>> raw = whoisit.get('8.8.8.8', raw=True)
>>> raw.get('entities')[0].get('entities')[0].get('roles')
['abuse']
>>> raw.get('entities')[0].get('entities')[0].get('vcardArray')[1][1]
['adr', {'label': '1600 Amphitheatre Parkway\nMountain View\nCA\n94043\nUnited States'}, 'text', ['', '', '', '', '', '', '']]

Might be able to check if the vcard it makes from the expected field is empty and then try to parse the entry_data field like so.

splits = entry_data.get('label').split('\n')
  v_card_array_data_dict['address'] = VCardArrayAddressDataDict(
      po_box=clean_address(entry_label[-7]),
      ext_address=clean_address(entry_label[-6]),
      street_address=clean_address(entry_label[-5]),
      locality=clean_address(entry_label[-4]),
      region=clean_address(entry_label[-3]),
      postal_code=clean_address(entry_label[-2]),
      country=clean_address(entry_label[-1])
  )

Could be a bit dodgy as \n separated data might vary?

meeb commented 1 week ago

Yeah this is because in your example the address is just stuffed into a label string which is pretty much impossible to reliably parse rather than have them split out properly in the vcard. Different localities have different address formats. For example, these are the adr element labels for 1.1.1.1 (all of these are returned from whoisit.ip('1.1.1.1', raw=True)):

'6 Cordelia St South Brisbane QLD 4101' '6 Cordelia St' 'PO Box 3646\nSouth Brisbane, QLD 4101\nAustralia'

Basically trying to parse these into a sensible street, locality postcode etc. is near impossible so the library doesn't bother attempting it.

With IP lookups if you need the address your best bet is probably to just use raw=True and look for adr labels yourself manually unfortunately.

insignia96 commented 1 week ago

Would it be difficult to include the raw, unparsed label alongside the structured data?

meeb commented 1 week ago

No, but that would go against what the parsed output was meant to do in the first place and what raw=True is for. If you need the raw data use that.

insignia96 commented 1 week ago

Makes sense to me. It's unfortunate that the RIR implementation is so lacking here, I have to imagine they could at least try to map some of these fields properly into the vcard.

meeb commented 1 week ago

Yeah it would be nice the data was all parsed and segregated correctly with the correct labels. My irritation with the weird formats used by different RDAP endpoints is why the parser module in whoisit is as expansive as it is already.

Are you OK if I close this or would you like to raise anything else?

jwahsnakupaku commented 1 week ago

Nah close it off, I'll just grab the raw data parse it manually and shove that address string in somewhere.

ip='8.8.8.8'
raw = whoisit.ip(ip, raw=True)
praw = whoisit.parse(whoisit._bootstrap, 'ip', ip, raw)

Cheers for your help/.

meeb commented 1 week ago

No problem. If RDAP endpoints ever do start returning addresses in a sane format it'll get added to the whoisit parser.