uk-fci / nhs-number

Python package to provide utilities for NHS Numbers, including validity checks, normalisation, and generation.
https://nhs-number.uk-fci.tech/
MIT License
4 stars 6 forks source link

NHS Number Generation #7

Closed pacharanero closed 1 year ago

pacharanero commented 1 year ago

It is suggested we add a feature to generate valid and invalid NHS numbers, for testing.

Generation of valid NHS Numbers that may be “live”.

@andylaw comment: In the thread referenced elsewhere about the “999” range, some were expressing concern that they might accidentally look up someone’s health records. Personally, I reckon that’s always going to be down to the idiocy of the person requesting the randomly-generated numbers – one should never test random data against a production system. Also, it’s impossible to generate a valid CHI number without it possibly being a live number.

I agree and I think we should make it clear in doco that firstly all generated numbers COULD be a real number and must never be looked up in a live system, and secondly that users should NOT be obtaining any 'live' NHS numbers from our package since they are all assigned by a central assignment body.

Generation of INVALID NHS Numbers to test other systems.

However, if you’ve included our package to check the validity and then ask us to generate a list of invalid numbers then I would expect your code to spot that they are invalid numbers.

eatyourpeas commented 1 year ago

So my suggestion is that any generative function should accept some flags as parameters. These might include: only_chi_numbers only_nhs_numbers only_hsc_numbers only_development only_invalid It would also allow generation of a batch of unique numbers. Disclaimers could be added to the docstrings in the function of the type already mentioned.

andylaw commented 1 year ago

@eatyourpeas - see Issues #5 and #6. I think the best way to address those (and this) is to have a set of "constants" defining regions/countries (we can alias them to cover any political sensibilities) and a separate set of constants to define (in)validity states.

i.e.

REGION_ENGLAND.= 'england'
REGION_WALES = 'wales'
REGION_SCOTLAND = 'scotland'
REGION_NORTHERN_IRELAND = 'northern ireland'

and

VALID = 'valid'
INVALID = 'invalid'
INVALID_CHECKSUM = 'invalid checksum'
INVALID_DATE_OF_BIRTH = 'invalid date of birth'
INVALID_RANGE = 'invalid range'

We could do this using ENUMs. That would pin us to Python 3.4 and above.

We'll need to decide if we provide signature(s) that allow numbers for multiple regions to be generated. My instinct is to make it a single region per function call.

pacharanero commented 1 year ago

I agree @andylaw - I have actually been working on something similar for the regions - a constant REGIONS which contains the start and end number ranges as well as a plain text representation of the Region Name, and then a tag section which can be used to identify the region using several equally valid aliases, as you say to help ameliorate any political sensitivities.

Something like:

REGIONS = [
    { "start": 10_000_000, "end": 311_299_999, "label": "Scotland CHI numbers", "tags": ["scotland", "chi"] },
    ... ,
    ... ,
    ]

This constant can then be imported wherever needed, and would serve as the 'source of truth', and in the documentation we can provide references for the facts asserted.

I have also got some clarity from my Irish contacts on the ROI situation - I will link to it in another Issue so we can put all Eire/Ireland related info in that thread and link back here.

andylaw commented 1 year ago

"start" should be 01_000_000 for CHI numbers unless you want to disenfranchise everyone born before the 10th of any given month :)

I would also suggest a design that actually names the regions individually. All those strings are 'magic' as you've coded it

pacharanero commented 1 year ago

"start" should be 01_000_000 for CHI numbers unless you want to disenfranchise everyone born before the 10th of any given month :)

From our README information the number range for CHI is 100000000n-319999999n, is this incorrect? The underscores as 000's separators works in Python for integers but doesn't seem to like having leading zeros, at least, the syntax highlighting/linter doesn't like it.

We could simply use strings everywhere, that might be simpler.

I would also suggest a design that actually names the regions individually. All those strings are 'magic' as you've coded it

I'm not sure what you mean by 'magic'.

Do you mean having a key:value of the range name:

REGIONS = [
    "scotland": { "start": 10_000_000, "end": 311_299_999, "label": "Scotland CHI numbers", "tags": ["scotland", "chi"] },
    "wales": { ... } , etc
    ... ,
    ]
andylaw commented 1 year ago

From our README information the number range for CHI is 100000000n-319999999n, is this incorrect?

It is incorrect. The first six digits of a CHI number are the holder's date of birth in DDMMYY format (see https://www.datadictionary.nhs.uk/attributes/community_health_index_number.html). Thus, the first digit will be a zero in roughly 30% of cases.

The lowest number that's valid for a CHI number (excluding checksum) is therefore 010100000 (female born on 1st January 2000 - more likely than 1900)

The underscores as 000's separators works in Python for integers but doesn't seem to like having leading zeros, at least, the syntax highlighting/linter doesn't like it.

We could simply use strings everywhere, that might be simpler.

Numbers is fine, but they can and do start with a zero

andylaw commented 1 year ago

I would also suggest a design that actually names the regions individually. All those strings are 'magic' as you've coded it

I'm not sure what you mean by 'magic'.

Do you mean having a key:value of the range name:

REGIONS = [
    "scotland": { "start": 10_000_000, "end": 311_299_999, "label": "Scotland CHI numbers", "tags": ["scotland", "chi"] },
    "wales": { ... } , etc
    ... ,
    ]

No. I mean...

REGION_SCOTLAND = {
    "start": 1_000_000,
    "end": 311_299_999
    "label": "Scotland CHI numbers"
}
REGION_ENGLAND = {
...
}

REGIONS = [
    REGION_SCOTLAND,
    REGION_ENGLAND,
    ...,
]

...and then I can write code like...

for region in REGIONS:
 ...

...and

valid = nhs_number.is_valid('1234567890', for_region=REGION_SCOTLAND)

Bonus points awarded for coding the Region as a Class and creating one named instance per region at the top of the code which lets syntax checkers look inside and make sure that we don't hit typos for start and end when we're using them