steno-aarhus / osdc

Open-Source Diabetes Classifier: an R package to classify diabetes status in Danish registers
https://steno-aarhus.github.io/osdc/
Other
1 stars 1 forks source link

Create a fake dataset to test that the functions work #4

Closed lwjohnst86 closed 6 months ago

lwjohnst86 commented 1 year ago

When eventually converting this into a formal package, it's good to add unit tests to confirm the code is doing what we expect. For that you'd need to create a fake dataset with all the variables used along with a dataset/column with the right answer.

Aastedet commented 11 months ago

I've added the first part of a script to generate some fake data covering the edge cases I could think of (@198906f). Only medication data is covered at the moment. Diagnosis data is going to be a bit tricky to do, but health insurance data and laboratory data should be a good place to start. Maybe something for @signekb to look at if she gets time to look at it before I do.

Lab data should basically be a data frame with four columns: pnr: Patient ID SAMPLINGDATE: Date of sample ANALYSISCODE: NPU code of the analysis type VALUE: The numerical result of the test.

Content-wise, it should contain something like this: "pnr": ID variable. A character variable consisting of random values from 001 to 100. "SAMPLINGDATE": A date variable of random dates from 1995 to 2015. "ANALYSISCODE": A character variable. Half of all values should be 'NPU27300' or 'NPU03835'. The rest of the values should be 'NPU' followed by five random digits. "VALUE": A double variable. Contains random values between 0.1 and 99.9.

Health insurance data is also a four-column dataframe, but it might be a little bit trickier since one variable is a weirdly formatted date: pnr: Patient ID BARNMAK: Was the service performed for the patient's child or not SPECIALE: The service code HONUGE: Year/week of the service being billed.

Content-wise: "pnr": ID variable. A character variable consisting of random values from 001 to 100. "BARNMAK": A binary variable. Is almost always 0, there are only a few 1's. "SPECIALE": A 6-digit integer variable. Half should be random samples between 100000 and 600000, and the other half random samples from 540000 to 549999. "HONUGE": a 4-digit character variable. The first and second digits should be random numbers between 01 and 52. The third and fourth digits should be random numbers between 00 and 99.

Aastedet commented 11 months ago

If you paste these descriptions into ChatGPT or similar and ask for an R data frame, you might get a head start (I did for medication)

lwjohnst86 commented 11 months ago

Mmm, looking through the code, I'm not sure what ChatGPT is doing to be honest... I've refactored the code in #19 to make more sense.

lwjohnst86 commented 9 months ago

Create this dataset with usethis::use_data(..., internal = TRUE)