uk-fci / nhs-number

Python package to provide utilities for NHS Numbers, including validity checks, normalisation, and generation.
https://nhs-number.uk-fci.tech/
MIT License
4 stars 6 forks source link

generate() should default to providing numbers from the unallocated development / test range #24

Open mattstibbs opened 1 year ago

mattstibbs commented 1 year ago

Should generate() default to generating numbers that are from the dedicated test range? I.e. 900000000n - 999999999n

It would still create valid NHS numbers, but would remove the risk of using a live NHS number as a test number.

If we do feel there is a need for generating live numbers from specific regional ranges, then that should only happen by specifying a region parameter

pacharanero commented 1 year ago

It's a fair point, although it means that the default behaviour of generate() might be unintuitive to some people. Worth considering the views of the community - let's see what people think.

mattstibbs commented 1 year ago

I'd have thought it would be clear enough if the usage documentation clearly stated that by default numbers are generated in the synthetic range, unless explicitly specified.

andylaw commented 1 year ago

I disagree. There's a perfectly usable for_region parameter that you can use to restrict the range of numbers returned. The absence of a for_region means there are no range restrictions.

mattstibbs commented 1 year ago

My proposal is about helping to protect users from generating random live NHS numbers for testing unless they explicitly need to. The parameter is only as good as the people who use it. If it is easier not to use it, users will generally not.

It's perfectly pythonic to apply a default keyword parameter in absence of a user-provided one. In this case that would be a default of region=synthetic.

I have experienced several incidents over the years of people using live NHS numbers for testing, and for various reasons those then finding their ways into places they shouldn't and causing confusion / data issues.

What is the use case for generating a random number from all regions that couldn't be met with an explicit region=all parameter for instance?

andylaw commented 1 year ago

My proposal is about helping to protect users from generating random live NHS numbers for testing unless they explicitly need to. The parameter is only as good as the people who use it. If it is easier not to use it, users will generally not.

Then why not just declare and export SAFE_NHS_NUMBER = "9990000018" and be done with it.

It's perfectly pythonic to apply a default keyword parameter in absence of a user-provided one. In this case that would be a default of region=synthetic.

In my experience, the supplied arguments that are defaulted tend to be numbers and Booleans, not more complex objects

I have experienced several incidents over the years of people using live NHS numbers for testing, and for various reasons those then finding their ways into places they shouldn't and causing confusion / data issues.

Your problem there is people testing on live systems, not code that generates NHS Numbers.

What is the use case for generating a random number from all regions that couldn't be met with an explicit region=all parameter for instance?

Or simply create a function called generate_test_numbers() to do what you are looking for. More Pythonic in that it is explicit about what it does, rather than through magic

One could invert your logic and ask "What is the use case for generating a random number from all regions that couldn't be met with an explicit region=REGION_TEST parameter for instance?".

Changing the current behaviour would also be a breaking change, in my opinion, so there's another reason for keeping it as it is and extending if required

mattstibbs commented 1 year ago

Then why not just declare and export SAFE_NHS_NUMBER = "9990000018" and be done with it.

I don't really understand this point I'm afraid - but I infer this wasn't a serious suggestion.

In my experience, the supplied arguments that are defaulted tend to be numbers and Booleans, not more complex objects

I agree they tend to be static values - I think constants are fine personally - but I can't question your personal experience.

Your problem there is people testing on live systems, not code that generates NHS Numbers.

Yes - the problem is generally and very often how people use things, not the things themselves. Generally, this is not an argument for not helping users do sensible things. We cannot predict how people might use this library to generate numbers, in particular simply as a tool to generate test numbers in large quantities etc.

Or simply create a function called generate_test_numbers() to do what you are looking for. More Pythonic in that it is explicit about what it does, rather than through magic

That's a good alternative suggestion. It does add complexity in a different way, but could be documented clearly and give people an easy way of generating numbers that they know can never be live.

One could invert your logic and ask "What is the use case for generating a random number from all regions that couldn't be met with an explicit region=REGION_TEST parameter for instance?".

Sorry I don't understand this.

Changing the current behaviour would also be a breaking change, in my opinion, so there's another reason for keeping it as it is and extending if required

I agree with the general principle about avoiding breaking changes. The package was only published today - I expected the usage to be very low on that basis.

andylaw commented 1 year ago

Then why not just declare and export SAFE_NHS_NUMBER = "9990000018" and be done with it.

I don't really understand this point I'm afraid - but I infer this wasn't a serious suggestion.

If you're concerned that users will generate NHS numbers that are potentially live numbers, then why not mandate that they just use a single pre-defined value to test the systems that you're responsible for. Why do they need more than one? Why do they need anything that might be a live number?

It's a slightly tongue-in-cheek suggestion, yes, but I'm gently highlighting how far you could go with the "won't someone think of the poor defenceless naive coders" argument

andylaw commented 1 year ago

Changing the current behaviour would also be a breaking change, in my opinion, so there's another reason for keeping it as it is and extending if required

I agree with the general principle about avoiding breaking changes. The package was only published today - I expected the usage to be very low on that basis.

A breaking change is a breaking change

mattstibbs commented 1 year ago

A breaking change is a breaking change

Again - I'm sorry, but I'm not clear as to your point here. If there was ever a time to make breaking changes with less impact, it's before anyone is using the package - surely. This is a distraction from the original question.

If you're concerned that users will generate NHS numbers that are potentially live numbers, then why not mandate that they just use a single pre-defined value to test the systems that you're responsible for. Why do they need more than one? Why do they need anything that might be a live number?

There are surely imaginable scenarios for using multiple validly-formatted NHS numbers for testing purposes. That's not to say that many testing use cases couldn't be met with a single allocated number.

It's a slightly tongue-in-cheek suggestion, yes, but I'm gently highlighting how far you could go with the "won't someone think of the poor defenceless naive coders" argument

Generally, this is not a satisfactory argument for not addressing any potential harms. This is ultimately a risk mitigation argument.

mattstibbs commented 1 year ago

@andylaw I feel strongly that software in healthcare should help people do good and sensible things. I do not feel strongly about the implementation detail.

I'll leave this with you now, as it would appear you do feel incredibly strongly both about the implementation details and about the idea of compromising the code for a "non-issue".

andylaw commented 1 year ago

One could invert your logic and ask "What is the use case for generating a random number from all regions that couldn't be met with an explicit region=REGION_TEST parameter for instance?".

Sorry I don't understand this.

It was a clumsy, and logically wrong as I have written it, attempt to point out that your request to generate NHS numbers in a particular block could equally be met by specifying a range. Which it could be. What we're differing on is way that we instinctively interpret the "natural" output of the function.

I've been working on systems that span the entire UK. I haven't used the generator, but I have use cases for sets of NHS numbers that trigger different actions depending on the region that they come from. In my mind, specifying a region restricts the numbers generated. Thus, not specifying a region should produce numbers from the entire spectrum of possibilities

andyatterton commented 3 months ago

Sorry I feel like I'm taking over your issues section. Hope you don't mind me contributing to this conversation as well.

The current behaviour of this is that it creates an NHS number from the full range, meaning it could very well generate one from the synthetic range anyway, or it might not. My feeling is, regardless of generating live numbers, being explicit about the region you would like to test is preferable.

In terms of the argument for_region, the default is currently None and there is no need to change it. Instead, in the generate function you just need to change

ranges = [FULL_RANGE]

to

ranges = [RANGE_NOT_ISSUED_SYNTHETIC]

If it was me, I would be perfectly okay knowing that by default any number generated wasn't going to get confused for a live number and would see this as a security improvement. I have also experience live test numbers getting into places they shouldn't be, and it's a pain. Making a user supply a specific region just means they are forced to make a conscious decision to dice with potentially live numbers. You aren't preventing that behaviour, but you are making them consider their actions.

Seeing as you aren't changing the function call and only slightly changing the ambiguity of the output, I wouldn't think this qualifies as a breaking change.

In terms of why you might want to create more than one NHS number for testing, there are plenty of good use cases. We have systems that prevent users inputting an existing NHS number, we model family connections, and for load testing. I'm sure there are a host of other reasons.

My personal feelings aside, to implement the CHI number checks (issue 25) and have the valid argument of this function behave correctly, you have to lock the Scottish range down to only valid dates. Letting it randomly draw numbers from the Scottish range will mean it produces invalid CHI numbers. My suggestion would be to check the region, if it's Scotland build a random date, build the rest of the number, get the correct check sum, and return that. This means the existing behaviour for generating an invalid number would remain untouched.

If this is agreeable, I can make a PR that deals with both issues. I don't think it's possible/easy to have one behaviour without the other.

pacharanero commented 3 months ago

If this is agreeable, I can make a PR that deals with both issues. I don't think it's possible/easy to have one behaviour without the other.

A PR that fixes both issues would be most welcome.

I agree that number generation should by default come from the unallocated range to avoid accidentally using live numbers in code, tests or documentation, which could be seen as a breach of privacy for the individual whose number it was, even though it was inadvertent, unintentional, and didn't leak any PID. It's avoidable and so we should avoid it.

Having fully functional CHI number validation and generation would be a major boost to the library so thank you for offering to add this. I have colleagues in SCIMP and other Scottish government healthcare IT service who would be glad we have solved this.