rOpenGov / sweidnumbr

R package for Structural handling of identity numbers used in the swedish administration such as personal identity numbers (personnummer) and organizational identity numbers (organisationsnummer).
https://ropengov.github.io/sweidnumbr/
Other
8 stars 3 forks source link

Fix default proportion of males and pin checksums in rpin() #116

Closed pitkant closed 4 years ago

MansMeg commented 4 years ago

Hi!

Thank you! The documentation is not correct so you should build using oxygen to get the docs up to date.

Also, the checksum problem is not caught by any test suite as I can recall? Can you elaborate on that problem so I understand? Im hesitant to add such a change without being sure and having unit tests capturing the problem.

pitkant commented 4 years ago

Hi,

Thank you for your reply. I will update my documentation.

As for the checksum, the function initially generated random pins, for example "199503029688" and "197909135927". These randomly generated pins used the long form YYYYMMDDNNN to count the checksum, resulting the abovementioned 8 and 7. Using the shorter form YYMMDDNNN produces different results, 9 and 8 respectively, as the sum of digits lacks 1 2 + 9 1 (or 2 2 + 1 0 in case of people born in 2000s) from the beginning.

MansMeg commented 4 years ago

Ok? So what is the correct way? Do you have any reference where this is stated?

pitkant commented 4 years ago

Here is an illustration from SCB's website:

sa-beraknas-kontrollsiffran (source: https://www.scb.se/hitta-statistik/artiklar/2017/Personnumret-fyller-70-ar/)

MansMeg commented 4 years ago

Ah! Ok, then it is a bug.

Ideally we should have a test vase that fails with the vurrent implementation. I guess generating a large number of pins and then test the control number for those should work (and fail with the old implementation) but pass with the new?

If you could add that it would be great! If you dont have time I totally understand.

pitkant commented 4 years ago

I guess using some real world IDs would the most ideal case, to test both pin_ctrl and rpin functions at the same time. pin_ctrl function actually used a similar multiplier as the change I proposed (c(0, 0, 2, 1, 2, 1, 2, 1, 2, 1, 2, 0), with the last 0 reserved for the checksum marker) so it is no wonder a similar multiplier in rpin produces ID's that pass this test. At least ID numbers taken from skatteverkets API worked fine with pin_ctrl function

Here is the printout from my console:

sweidnumbr_test.txt

MansMeg commented 4 years ago

The mainbproblem is not pin_ctrl(), as you say. From your earliest example it seems like the current implementation of rpin() might simulate incorrect pins. Hence, using Skatteverkets test-pin will not help since the problem is in rpin().

Although, if you are correct, we would get pins that are incorrect if we just simulate enough pins? Hence we can add a test that simulate, say 100 000 pin and check that the control number is correct for all these pins. This should fail if rpin is incorrect? But work with your patch? Does this make sense?

pitkant commented 4 years ago

Since the control number produced by rpin() is a result of a rather straightforward algorithm that always produces a valid control number and the same algorithm is used to check the control number in pin_ctrl(), it is almost like a circular reference. Using 100, 1000 or 100000 PINs should make no difference as the algorithm should never produce invalid results or rare outliers that could be found by simulating very large numbers. Using an external source for PINs was just my way of trying to control whether the algorithm in pin_ctrl is correct or not.

I however tried the functions with 100 000 generated PINs and got the same results as earlier (rpin2 being the function with the small corrections proposed in this pull request and rpin being the original function):

`pins_new <- rpin2(100000) '#Given a set of logical vectors, are all of the values true? 'all(sweidnumbr::pin_ctrl(pins_new)) '# [1] TRUE

'pins_old <- rpin(100000) '# Given a set of logical vectors, is at least one of the values true? 'any(sweidnumbr::pin_ctrl(pins_old)) '# [1] FALSE`

MansMeg commented 4 years ago

Thanks!

Hmmm... Im not sure I fully understand. I thought that rpin() produced incorrect control numbers since the first two digits where not 0? The pin_ctrl() checks the control number correctly (without the two first digits). Hence, if this is the problem rpin should produce incorrect control numbers for the pins? Otherwise, it would not be a bug? I get the following to work:

set.seed(4711)
pins_old <- sweidnumbr::rpin(1000)
# Given a set of logical vectors, is at least one of the values true?
all(sweidnumbr::pin_ctrl(pins_old))
# [1] FALSE

Hence, the rpin() generates pins with incorrect control number. If we just add this as a test case this should now fail. Then we add your correction and the test case should pass. Does this make more sense?

pitkant commented 4 years ago

By test case you mean that the code snippet above would be added to the documentation as an example?

MansMeg commented 4 years ago

Hi!

No, add it to the test suite of rpin(). You can find it here: https://github.com/rOpenGov/sweidnumbr/blob/master/tests/testthat/test-rin.R I think it is evident how to inglude it, otherwise you can try and push it to the PR and we can discuss it. If you use R-Studio and setup this repo as a project you can run the test suites under the build tab.

Is this more clear? If you find it difficult, I can fix it.

MansMeg commented 4 years ago

Great! There is still some errors in the docs:

Codoc mismatches from documentation object 'rpin':
rpin
  Code: function(n, start_date = "1900-01-01", end_date = Sys.Date(),
                 p.male = 0.5, p.coordn = 0.1)
  Docs: function(n, start_date = "1900-01-01", end_date = Sys.Date(),
                 p.male = 0.1, p.coordn = 0.1)
  Mismatches in argument default values:
    Name: 'p.male' Code: 0.5 Docs: 0.1

I think you should run Roxygen once in R-studio and then update the documentation for the rpin() function. Then the tests should probably pass.

pitkant commented 4 years ago

I could not replicate this error on my end; checking for code/documentation mismatches passed smoothly on all cases.

Which one ultimately is the desired default for p.male, 0.1 or 0.5?

MansMeg commented 4 years ago

If you run check package in R-Studio you should probably get the same error if not, you should have already produced a new rin.rd doc that you can commit and push to the PR to fix this error.

coveralls commented 4 years ago

Coverage Status

Coverage remained the same at 95.254% when pulling 11c085fe5846869edfd43a1d640c981bdc257a3c on sweidbranch into 0ebea64b4bf731a20079e6f30c29a4ad0ed0bb51 on master.

MansMeg commented 4 years ago

Nice work! I will now merge!