uk-fci / nhs-number

Python package to provide utilities for NHS Numbers, including validity checks, normalisation, and generation.
https://nhs-number.uk-fci.tech/
MIT License
4 stars 6 forks source link

New features #10

Closed pacharanero closed 1 year ago

pacharanero commented 1 year ago

Closes #7 NHS Number Generation Closes #5 'Geovalidation' Closes #6 Explanation of why a given NHS number is invalid

This PR implements:

andylaw commented 1 year ago

Quick suggestion relating to the regions, since England-Wales-IOM have multiple Blocks as valid ranges.

Instinctively I would refactor the Region class to have a list of 1 or more blocks, then put the check for a given number being valid by range for a given region into the Region class rather than in the is_valid function.

for block in blocks:
    if supplied_number >= block.start and supplied_number <= block.end:
        return True
return False
andylaw commented 1 year ago

Plus the England-Wales-IOM Region needs a tag of 'england'

pacharanero commented 1 year ago

Plus the England-Wales-IOM Region needs a tag of 'england'

Done this just now

Instinctively I would refactor the Region class to have a list of 1 or more blocks, then put the check for a given number being valid by range for a given region into the Region class rather than in the is_valid function.

I'll consider this for a future addition, you're right it's a pain and unintuitive to have to do two separate checks to ensure your NHS number is valid in England, Wales, and IOM (ie you would have to run it through both REGIONs) It would be better if the complexity of having two ranges was abstracted away from the user.

andylaw commented 1 year ago

Hmm. Not sure that I like the generate() function.

I'd argue that the default usage of generate() should be to produce a randomised set of numbers I'd also argue that there could be a perfectly valid use case whereby a user specified a sub-range of valid numbers i.e. specified a Region and a start and end, but that's not possible with this implementation If one function argument overrides other suppled function arguments then we should warn()

What is your use case for generating sequential numbers?

eatyourpeas commented 1 year ago

I don't feel so strongly about the sequential numbers thing, and doubt people will care. If they are valid, they are valid.

My suggestion originally had a single function generating one random number, then a second function which called the first where the user prescribed how many numbers they wanted. Within this I imagine you could prescribe ranges and tag to regions along the lines suggested above.

The other thing I found running timeit was that it was quicker to generate the first 9 random digits then use these to calculate the checksum. Randomly generating 10 digit numbers and hoping they would pass the validate() method before adding to the list was slower.

Also, I found that once generated adding each number to a Set (rather than a list) I did not have to check for uniqueness which could be converted easily to a list before passing back. This also speeded things up a lot otherwise you had to have a way of checking for duplicates. I think in the end I was generating 10,000 numbers in 1 or 2 seconds on my 2020 i7 Mac.

Not sure how to take this forward. Maybe @andylaw if you want this to be a particular way, perhaps you could refactor and I could review (with Marcus' permission)? Points raised to me are all good - it feels to me that we need a function that out of the box is simple and generates lots of numbers quickly, but has the ability to prescribe regions/randomness for the specialised user: that looks to me the intent you have built this with @pacharanero and otherwise to me looks pretty concise and readable.

andylaw commented 1 year ago

The other thing I found running timeit was that it was quicker to generate the first 9 random digits then use these to calculate the checksum. Randomly generating 10 digit numbers and hoping they would pass the validate() method before adding to the list was slower.

I did not spot that but you are obviously correct. One in eleven 9-digit random numbers will fail to generate a valid checksum but ten out of eleven 10-digit numbers will be invalid. The method needs to be changed to match your algorithm

andylaw commented 1 year ago

I've spent the evening hacking on what was there to make it more how my head works. The fruits of that possibly pointless exercise are in the branch feature/refactor-generate. It's a big lump because I needed to add some extra bits before I could implement some refactoring.

I created a Range object, and refactored the Regions to contain a list of ranges. That allows us to treat England, Wales and IOM as a single Region and have multiple Ranges inside. Equally it condenses the Unallocated ranges into a single pseudo Region. I refactored the generate() function to not have start and finish and to always generate random numbers in the range that we specify. I used the algorithm that @eatyourpeas suggested with making the 9 digit number and then calculating the checksum. When invalid numbers are requested, I simply use the correct checksum as a foil for selecting a random digit from the range [0..9]. Each time around, I pick a random Range from the supplied region. That's not quite random since it relies on all the Ranges being of equal size if we were to be truly representative across the full set of possibilities. We can tweak that later if we decide this is a route worth following.

I have thrown away the sequential numbers code and the start and end limits to generate. If you have an actual use case for them, then I apologise and we can put them back but absent a use case that is unnecessary code.

Now I'm going to bed.

andylaw commented 1 year ago

There is an off-by-one error, possibly off-by-two, in the generate function I think (on reflection). I'll leave that to fester overnight

eatyourpeas commented 1 year ago

I have not tested this but your problem may be here:

candidate = randint(this_range.start, this_range.end) // 10
candidate_str = str(candidate).zfill(9)

I think maybe candidate only needs the // 10 if it is more than 9 digits long, to remove the final digit. If it is exactly 9 digits, it does not need zfill either. If your candidate without removing digits is less than 9, that is when you need the zfill to top it up to 9.

So more like:

candidate = randint(this_range.start, this_range.end)
candidate_length = len(str(candidate))
if candidate_length>9:
    candidate_str = str(candidate // 10) # though could it be more than 10 digits?
elif candidate_length < 9:
    candidate_str = str(candidate).zfill(9)
else:
     candidate_str=str(candidate)
    ....calculate checksum from candidate
andylaw commented 1 year ago

Now I've thought about it a bit more, I think it's only an off-by-one-tenth error. I always get tripped up by the Python range not including the end value.

My logic with the generate function is that the number can be anywhere in the range from start to end which are 10-digit numbers because the definition of an NHS number says that they are. So we generate a number in that range. However, the last digit of the 10 is a checksum. Therefore the true range of the core NHS number embedded within the 10-digit is one tenth of the 10-digit number range, and 90% of the generated numbers will be invalid because their checksum digit is wrong.

So, throwing away the last digit by doing an integer division by 10 should give us an even spread across the potential core NHS number range for a given region-range. Then calculate the checksum for the now 9-digit number and there you have a valid NHS number. That should be a lot more efficient than using the 10-digit number and throwing 90% of them away, as you pointed out @eatyourpeas

pacharanero commented 1 year ago

There is an off-by-one error, possibly off-by-two, in the generate function I think (on reflection).

The tests are all passing for me and a quick usage of the generate() function in interactive Python also seems to work. Would you clarify where you think the error is?

Other than that I'm happy to just merge the changes - the use case for the generate() function is really only for testing and as long as we can generate some NHS numbers which are valid and invalid for a given region, which we can here, then I'm happy. The sequential numbers are not really needed unless someone asks for them in the future and gives a compelling use-case.

eatyourpeas commented 1 year ago

Sounds good to me. One other question I had actually. The params for the generate function insist that the user create a Region object if they want numbers for a specific region - might we not make it easier for them if we ask them only to supply the name of the region they want and use use the Region and Range classes in the background to generate the numbers? Along the lines of:

def generate(
    valid: bool = True,
    for_region: Literal[
        "UNRESERVED_1",
        "SCOTLAND",
        "UNRESERVED_2",
        "NORTHERN_IRELAND",
        "ENGLAND_WALES_IOM_1",
        "NOT_ISSUED",
        "ENGLAND_WALES_IOM_2",
        "EIRE",
        "UNRESERVED_3",
        "NOT_ISSUED_SYNTHETIC",
    ] = None,
    quantity: int = 1,
) -> list:

and then...

if for_region:
        # if isinstance(for_region, Region):
        if for_region in REGIONS:
            selected_region: Region = REGIONS[for_region]
            # if a region is supplied, use the start and end ranges for that region
            # and ignore any start and end ranges supplied in the function call
            while len(nhs_numbers) < quantity:
                nhs_numbers.add(generate_nhs_number(selected_region=selected_region))

        else:
            raise TypeError("The for_region argument must be a valid region")

Just a suggestion.

pacharanero commented 1 year ago

insist that the user create a Region object if they want numbers for a specific region

When passing in a Region object they get this Region from the constants file, they don't need to create a Region object, they need to use the ones we've provided:

>>> from nhs_number import generate
>>> from nhs_number import REGIONS # use the existing REGIONS
>>> g = generate(for_region=REGIONS["ENGLAND_WALES_IOM"])
['6078301586']

I do think perhaps we could abstract this away though, and give the user some text options for region name, which we match to the REGION name under the hood.

andylaw commented 1 year ago

They don't even need to do that. REGION_ENGLAND, REGION_WALES, REGION_SCOTLAND are all declared and exported

eatyourpeas commented 1 year ago

Great thank you.

pacharanero commented 1 year ago

1) Is there an argument that we don't need a label in both the Range objects and the Region objects? I'd say just label the Regions.

2) Everything else being equal are we OK to merge this PR and then @andylaw's further refactor and call it done for now?

andylaw commented 1 year ago
  1. Is there an argument that we don't need a label in both the Range objects and the Region objects? I'd say just label the Regions.

I put them in both on the grounds that I figured we might want to report that number 'xyz' was valid for Region "England, Wales and IOM" and was from Range "Lower from 4000... to 49999...". It gives additional granularity but if we're not using it then there's a decent argument that we should remove it until such time as we need it. Instinct tells me that we need it for the full details() reporting, but instinct also tells me that the details() functionality is probably sparkles rather than requirement.

andylaw commented 1 year ago

2. Everything else being equal are we OK to merge this PR and then @andylaw's further refactor and call it done for now?

Or bin this and create a new, all-encompassing PR. Again, I have no firm opinion other than that we should do one or the other :)

pacharanero commented 1 year ago

One thought: Did the off-by-1/10th error get fixed? @andylaw

eatyourpeas commented 1 year ago

I am not sure whether it was a real error or not. To my mind it would probably be tidier if @andylaw were able to incorporate all the ideas from this thread into his recent code and resubmit as a fresh offer. Should then be a formality to nod it through and get it published.

Thank you @pacharanero for driving it forward, and thank you @andylaw for the code and ideas.

andylaw commented 1 year ago

One thought: Did the off-by-1/10th error get fixed? @andylaw

Not worth fixing. It means that the last number in a range has a 10% less chance than any other number in being represented and it may be that for some of the ranges, that number is invalid anyway. I was confusing myself with the range(), integer arithmetic and other jiggery-pokery, compounded by late night brain fog.

Had I done the integer arithmetic before the random.choice() then it would have been a real off-by-one. I'm happy to live with it as is.

andylaw commented 1 year ago

Do we need to update the documentation in the README? It's certainly incorrect re the normalise() function, which is now called normalise_number() I think.

pacharanero commented 1 year ago

If we can merge these new features I'm happy to sort documentation tomorrow. README will need updating with usage examples for all the features.