google / libaddressinput

Google’s postal address library, powering Android and Chromium
Apache License 2.0
580 stars 104 forks source link

Bahrain (BH) postcode regex incorrect #207

Open BenWalters opened 3 years ago

BenWalters commented 3 years ago

Hi,

I believe the postcode regex for Bahrain to be incorrect. https://www.gstatic.com/chrome/autofill/libaddressinput/chromium-i18n/ssl-aggregate-address/data/BH

It is currently (?:\d|1[0-2])\d{2} however I've found that postcodes are:

Valid post code numbers are 101 to 1216 with gaps in the range. Known as block number (Arabic: رقم المجمع‎) formally. The first digit in NNN format and the first two digits in NNNN format refer to one of the 12 municipalities of the country. PO Box address doesn't need a block number or city name, just the PO Box number followed by the name of the country, Bahrain.

And https://en.youbianku.com/Bahrain

roelofr commented 3 years ago

I made a quick script to find all numbers on the page, the gist is here.

The table on https://en.youbianku.com/Bahrain does seem to match this 1-12 range, so the following regexp would suffice:

(?:\d|1[0-2])\d{2}
BenWalters commented 3 years ago

@roelofr When I initially looked at the RegEx I too thought it would be okay but I had a play around and it seems to be ineffective for postcodes greater than 999: https://regex101.com/r/39cZU1/1/

I believe ^(1[0-1][0-9]{2}|12[0-1][0-6]|[1-9][0-9]{2}|10[1-9])$ might be a better solution: https://regex101.com/r/9nASya/1

roelofr commented 3 years ago

I believe ^(1[0-1][0-9]{2}|12[0-1][0-6]|[1-9][0-9]{2}|10[1-9])$ might be a better solution

This is very expressive for a regex, but mostly, it seems the Google API doesn't return start or end markers, possibly to allow finding postal codes in longer strings.

The pattern I posted matches 2103 matches 103, and ignores the 2. How could I miss that, thanks for the correction :+1:

So, to fix this, we probably need to check for word markers or EOLs. Since the last two digits are always required, I tried keeping them out of the alternate match.

So, my v2 would be:

(?:^|\b)(?:1[0-2]|[1-9])\d{2}(?:$|\b)

I ran it again your pattern, and added a 012, which should be invalid too. Results are here.

Small edit: made all groups non-capturing groups.

BenWalters commented 3 years ago

(?:^|\b)(?:1[0-2]|[1-9])\d{2}(?:$|\b)

Looks good to me! I'd be more than happy with that being implemented.

BenWalters commented 3 years ago

@sockix @sebsg are you able to assist with this change?

sockix commented 3 years ago

Hi Ben. Yep, I will take a look and see if I'm able to make the change.