uk-fci / nhs-number

Python package to provide utilities for NHS Numbers, including validity checks, normalisation, and generation.
https://nhs-number.uk-fci.tech/
MIT License
4 stars 6 forks source link

CHI number validation #25

Open mattstibbs opened 1 year ago

mattstibbs commented 1 year ago

CHI numbers conform to the standard NHS validation format, but have an additional requirement that the first 6 digits must be a valid date of birth (DDMMYY).

The validate functions should identify that a provided number falls into the Scotland range, and then apply the additional validation checks.

https://en.wikipedia.org/wiki/National_Health_Service_Central_Register#Community_Health_Index

Patients are identified using a ten-digit number known as the CHI Number. This number is normally formed using the patient's date of birth (as DDMMYY), followed by four digits: two digits randomly generated, the third digit identifying gender (odd for men, even for women) and a check digit.

This new validation should then be applied to the generate() function when a user passes in for_region=nhs_number.REGION_SCOTLAND so that only a valid CHI is generated.

pacharanero commented 1 year ago

Thanks @mattstibbs we would definitely like to add this additional validation, we are open to a PR :smile:

mattstibbs commented 1 year ago

The validate functions should identify that a provided number falls into the Scotland range, and then apply the additional validation checks.

Question: is this the right behaviour?

i.e. is_valid called on a number that falls within the CHI range will automatically apply the additional CHI validation in order to decide whether the number is valid.

Alternatively, should CHI validation be an additional validation applied somehow, leaving is_valid to validate only that the number falls into the correct format with a valid checksum?

pacharanero commented 1 year ago

I think perhaps if/when we get around to adding strict CHI number validation, this could be something like

>>> import nhs_number

>>> nhs_number.is_valid("1406230002",
                     for_region=nhs-numberREGION_SCOTLAND,
                     sex="Male"
                     strict_chi=True)
True

so that the strict_chi=True flag adds: strict CHI validity checking with DOB as plausible DDMMYY date and that the sex accords with the odd/even sex digit.

for_region continues to purely check the region/range of numbers.

We would just document clearly that this is what you would need to do in order to check a CHI number properly. For additional syntactic sugar we could create a convenience function is_valid_chi_strict(chi_number: str) function which wrapped the above is_valid syntax and added the CHI related parameters.

This os just one of many ways to implement, and I am happy for the implementer to add their own flavour.

andylaw commented 1 year ago

It would be very easy to overthink this around the phrase "plausible" date of birth

For example - 2904240012.

Not valid if the date of birth is 1924, but valid if it is 2024, but that's in the future so...??

Lots of glorious rabbit-holes to disappear down :)

mattstibbs commented 1 year ago

@pacharanero Just so I'm clear, would you suggest that we keep the default behaviour of is_valid to ignore the CHI-specific DOB validation, even for numbers that fall within the CHI range?

And then add the option to apply strict CHI validation to is_valid and add a new method is_valid_chi_number as a wrap around is_valid which applies any additional CHI validation logic we want to specify

pacharanero commented 1 year ago

Yes, that's what I'm suggesting, however I'm open to views, especially from anyone actively validating CHI numbers.

I am given to understand that Python prefers 'explicit' over 'implicit' so having explicit flags for strict CHI checking seems more Pythonic? They don't like 'magic' do they, Python people :wink:. Rubyists like me, we love magic.

I have some colleagues in GP in Scotland and I will ask them to have a look and give their views. In general I'm keener to see working code than a lot of worrying about how to implement. It'll be fine as long as it works and is documented.

andylaw commented 1 year ago

They don't like 'magic' do they, Python people 😉.

Hate it. Mind you, I used to be (still am, under the hood) a Perl person ("There's more than one way to do it")

pacharanero commented 1 year ago

(still am, under the hood) a Perl person ("There's more than one way to do it")

You're probably more of a Rubyist than you think then! Ruby gives you all the ways to do it, and all the footguns - you as the user need to know where to point the gun :-)

andylaw commented 1 year ago

You're probably more of a Rubyist than you think then!

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-221

mattstibbs commented 1 year ago

I think the key is probably agreeing on what we mean by a 'valid' number. i.e. is a valid number just conformant with the general format and checksum, or is it only valid if it could feasibly be issued etc.

There is probably an alternative argument to say that is_valid should reflect the rules we've documented in the library and the rules that people will encounter in real life, and that someone could opt to loosen the validation i.e. with an ignore_chi_validation flag.

i.e. is this simple a format validation library, or is it a helper library for the way things are in reality

VJ911 commented 1 year ago

Practically, you know whether it is a CHI number or H&C number based upon the range, so you shouldn't need additional arguments.

010 100 0000 to 311 299 9999 (used for CHI numbers in Scotland) 320 000 0001 to 399 999 9999 (allocated to the Northern Ireland H&C) 400 000 0000 onwards for England & Wales

Logic as follows: if cast to int of left 6 digits < 320,000 then check that if datetime.strptime throws an error.

andyatterton commented 3 months ago

Hey guys. Been looking into your package as a replacement for our own in house solution, and I was curious if this conversation was resolved.

If you don't mind me making a suggestion, I think I would take a binary approach and the number is either valid or it is not. If the date section isn't possible it's not valid and as such I would add the check to the is_valid function using a check for region, not give CHI numbers their own is valid wrapper.

We are currently taking the approach mentioned by @VJ911 where we check for an error from strptime which I believe takes care of any leap year oddness. I would be of the opinion that to be a valid CHI number it must be in the range of CHI numbers and the first 6 digits must make up a valid date.

Something along the lines of

if for_region == 'RANGE_SCOTLAND':
            try:
            # Should start with ddmmyy date
                datetime.strptime(nhs_number[:6], "%d%m%y")
            except ValueError:
                return False

You could have is_valid() return a NhsNumber object (details.py) that had some extra attributes to cover the information supplied with CHI numbers. Valid is a current attribute, but you could include an in_range attribute so that CHI number in range but with impossible dates would be {valid: False, in_range: True}

Could also return the sex rather than asking a user to specify it when calling is_valid so the user can run their own checks against that. On top of that, you could include a faliure_message to provide some feed back about the impossible date being the issue and any other issues that you might encounter during validation.

{valid: False, in_range: True, sex: 'male', faliure_message: 'In CHI number range but date section impossible' }

The major draw back would be that this would all constitute breaking changes if applied to is_valid() so it might need a detailed_is_valid() function or something 🤷‍♂️

pacharanero commented 3 months ago

Thanks @andyatterton for this, very valid points made and I think we should finish off the CHI implementation so it works for real-life use-cases such as yours. I don't think we have a huge user base of the library at this stage, so a breaking change that makes is_valid more accurate for real-world usage of CHI numbers is unlikely to upset anyone.

If anyone is up for making the update and creating a PR, I think we would be amenable to accepting that. If not, I will get around to this when I get around to it.

andyatterton commented 2 months ago

I opened a PR but aimed it at main instead of dev and didn't get round to mentioning it here. I haven't had chance to check back until now so apologise for that. I think you can remove #37 and hopefully #38 will answer this issue.