Handling dataset with different time periods

ybkamaleri commented 2 years ago

Some datasets have categories of different time periods eg. UFORE data is from 2000 to 2021 and the data have 3 categories for variable YTELSE ie. varing, aap and samlet. Two of these categories start from 2011 while varing from 2000. What is the best way to handle the variables that don't start at the same time period?

Adding empty rows with samlet and aap from 2000 to 2010 will mean expanding the dataset ie. considering all the variables and their categories in the dataset eg. KJONN, ALDER etc, should also have 0 value.

Why do we need to do this?

jorgenRM commented 2 years ago

The dataset UFORE contains dimensions GEO AAR KJONN ALDER UTDANN LANDBAK INNVKAT YTELSE. Not all combinations of these dimensions are present in the dataset: If, for a given combination, there were zero uføre (teller=ANTUFORE=0), this combination is not present in the dataset. So, in principle, an absent combination should be interpreted (by R-løypa) as TELLER=ANTUFORE=0, that is, an "implicit zero". This is a piece of information that is passed on to R-løypa through a parameter specified in KHELSA.

The problem is that, in the dataset UFORE, not all absent combinations should be interpreted as ANTUFORE=0. In fact, all absent combinations involving YTELSE=aap and YTELSE=samlet and AAR<2011 should be interpreted as ANTUFORE=NA (or something of the sort)

These two different interpretations of absent combinations cannot both be specified (in KHELSA) for the same dataset.

SOLUTIONS:

For every absent combinations involving YTELSE=aap and YTELSE=samlet and AAR<2011, add an explicit row, and set ANTUFORE=NA. Problem: This will add an estimated 250 million rows to the dataset, blowing it up to 8 times its present size, and make the dataset hard to process further.
Produce 2 "kuber" instead of one; one for YTELSE=aap and YTELSE=samlet and STARTAAR=2011 one for YTELSE=varig and STARTAAR=2000 After these two kuber have been through R-løypa, stack them into a single kube.

Solution 2 could be done "manually", or we (Yusman) could try to develop a wrapping to be used for this type of datasets (there are more of them, e.g. SYSVAK). The wrapping would loop over the dataset, sending each subset through R-løypa, creating one kube for each subset of the data. After all of them have been through R-løypa, the same wrapping would stack the resulting kubes into a single kube.

Other solutions may be possible. What do you think?

ybkamaleri commented 2 years ago

How about reading the files that have varing as if it's for samlet and aap for files < 2011 and create dummy column samlet and aap with value for VAL1 as NA or 0 and then recode VAL1 in POST recode to : as data not available

raniets commented 2 years ago

(The Raw files in question have only one Antall column. What the Antall is counting is implicit in the filename.) I'm not sure if I understand your proposal, but I read it like this:

For each year < 2011: Read the file containing "varig" (- to get a row for each combination of the rest of the variables?) Drop the ANTUFORE (antall) column, and replace it with a dummy Antall column containing only NA. Then tell løypa that this is the file (table) containing "aap". After stacking the file onto the ORGfil being built, recode NA into the "data not available" code. Repeat to make a dummy "samlet" table for this year.

To me, this looks feasible - but I don't have the complete picture in my head of whether this is enough to trigger a correct treatment of all the (still) missing rows.

-SteinarB.

man. 28. mar. 2022 kl. 15:42 skrev YBK @.***>:

How about reading the files that have varing as if it's for samlet and aap for files < 2011 and create dummy column samlet or aap with value NA or 0 and recode these in POST recode to : as not data not available

— Reply to this email directly, view it on GitHub https://github.com/helseprofil/orgdata/issues/249#issuecomment-1080668617, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJPEW5P5P3LIWRR4UQSDO4TVCGZMTANCNFSM5NAOGVRQ . You are receiving this because you were assigned.Message ID: @.***>

ybkamaleri commented 2 years ago

That's correct. Please look at orgdata Access for UFORE for innlesing id of samlet_dummy and the recode table for it

ybkamaleri commented 2 years ago

Marie has tested the solution I suggested above and it works! 💃🏼 I presume this issue is now solved inntil det motsatte er bevist 😄

helseprofil / orgdata

Handling dataset with different time periods #249