ideal-postcodes / postcode

UK Postcode parsing and helper methods
https://postcodejs.ideal-postcodes.dev/
MIT License
69 stars 7 forks source link

Enhancing fix() #488

Open anirudhgangwal opened 1 year ago

anirudhgangwal commented 1 year ago

I am implementing a Python version of the library for my own use-case - https://github.com/anirudhgangwal/ukpostcodes. The library mimics functionalities available here, including lookup in ONS database (but I don't use a DB/api to postcode.io, just have a set of ~1.8M postcodes).

We parse postcodes from OCR output and the "O" and "I" errors account for almost all our errors. The fix implemented here was helpful in reducing our error significantly. However, I want to understand if there was a reason to not expand this auto-correct further.

Lets take the example of a 3 digit outcode. This can take the following forms: A9A 9AA A99 9AA AA9 9AA

Since the second and third characters can take on both letters or numbers, this library currently only coerces for "L??".

I think there is a possibility to add a new function, or a parameter to function, which returns a list. E.g.

fix(OOO 4SS) => ["O00 4SS", "OO0 4SS", "O0O 4SS"] # try LLN, LNN, and LNL

A quick Python implementation looked like this:

def fix_with_options(s: str) -> List[str]:
    """Attempts to fix a given postcode, covering all options.

    Args:
        s (str): The postcode to fix
    Returns:
        str: The fixed postcode
    """
    if not FIXABLE_REGEX.match(s):
        return s
    s = s.upper().strip().replace(r"\s+", "")
    inward = s[-3:].strip()
    outward = s[:-3].strip()
    outcode_options = coerce_outcode_with_options(outward)
    return [
        f"{coerce_outcode(option)} {coerce_incode(inward)}"
        for option in outcode_options
    ]

def coerce_outcode_with_options(i: str) -> List[str]:
    """Coerce outcode, but cover all possibilities"""
    if len(i) == 2:
        return [coerce("LN", i)]
    elif len(i) == 3:
        outcodes = []
        if is_valid_outcode(outcode := coerce("LNL", i)):
            outcodes.append(outcode)
        if is_valid_outcode(outcode := coerce("LNN", i)):
            outcodes.append(outcode)
        if is_valid_outcode(outcode := coerce("LLN", i)):
            outcodes.append(outcode)
        return list(set(outcodes))
    elif len(i) == 4:
        outcodes = []
        if is_valid_outcode(outcode := coerce("LLNL", i)):
            outcodes.append(outcode)
        if is_valid_outcode(outcode := coerce("LLNN", i)):
            outcodes.append(outcode)
        return list(set(outcodes))
    else:
        return [i]

This reduced our error rate further down (significantly as most errors were with misreading 0). Note for our use case did made sense as after checking with ONS directory there were negligible false positives.

cblanc commented 1 year ago

Thanks we'll take a look. CC'ing @mfilip

mfilip commented 1 year ago

Hey @anirudhgangwal it is nice approach but to implement to our lib we will need to break our interface pattern to return array of possible fixes when this is not indent for this simple lib. We see possible use cases for array but this lib is intend to just fix numeric mistake and return generally valid postcode.

A9A 9AA A99 9AA AA9 9AA

All of those are valid postcodes in it's construction. So our lib just trying to fix those not matching it so pattern L?? is sufficient to cover all of those. If your intend is to use it after for check in db your version will give you less errors and additional possibilities of fixes which is great!