statficlassifications
accesses the open classifications API of
Statistics Finland at https://data.stat.fi/api/classifications/v2.
statficlassifications
contains tools for searching and importing
classifications and classification conversion keys in an R-friendly
format. The package also provides special tools for (re)classifying
regions.
For further information about open classifications API of Statistics Finland, see https://www.stat.fi/en/luokitukset/info/.
To install and start using the package:
#install.packages("devtools")
#devtools::install_github("pttry/statficlassifications")
library(statficlassifications)
There are two types of classification objects of interest: keys and classifications. Keys, or conversion keys or correspondence tables, are used to convert between different classifications and classifications can be used to convert names to codes and vice versa.
To list all available classification keys (correspondence tables), use
search_keys()
without arguments:
head(search_keys())
#> [1] "instit_sektori 1996 -> sektoriluokitus 2000"
#> [2] "ammatti 2010 -> sosioekon_asema 2011"
#> [3] "ammatti 2001 -> ammatti 2010"
#> [4] "ammatti 1987 -> ammatti 2001"
#> [5] "ammatti 1980 -> ammatti 1987"
#> [6] "jate 2004 -> jateluokitus 2004"
Plug in search terms to search for available keys:
search_keys("maakunta")
#> [1] "kunta 1990 -> maakunta 1992" "kunta 1993 -> maakunta 1992"
#> [3] "kunta 1993 -> maakunta 1994" "kunta 1997 -> maakunta 1997"
#> [5] "kunta 1995 -> maakunta 1995" "kunta 1999 -> maakunta 1998"
#> [7] "kunta 1997 -> maakunta 1998" "kunta 2000 -> maakunta 2000"
#> [9] "kunta 2002 -> maakunta 2002" "kunta 2001 -> maakunta 2001"
#> [11] "kunta 2003 -> maakunta 2003" "kunta 2006 -> maakunta 2006"
#> [13] "kunta 2004 -> maakunta 2004" "kunta 2005 -> maakunta 2005"
#> [15] "kunta 2007 -> maakunta 2007" "kunta 2008 -> maakunta 2008"
#> [17] "kunta 2009 -> maakunta 2009" "kunta 2010 -> maakunta 2010"
#> [19] "kunta 2011 -> maakunta 2011" "kunta 2012 -> maakunta 2012"
#> [21] "kunta 2013 -> maakunta 2013" "kunta 2014 -> maakunta 2014"
#> [23] "kunta 2015 -> maakunta 2015" "kunta 2016 -> maakunta 2016"
#> [25] "maakunta 1997 -> suuralue 1994" "maakunta 1992 -> suuralue 1994"
#> [27] "maakunta 1995 -> suuralue 1994" "maakunta 1997 -> laani 1997"
#> [29] "laani 1997 -> maakunta 2005" "maakunta 1998 -> suuralue 1994"
#> [31] "maakunta 1997 -> suuralue 1997" "maakunta 2005 -> suuralue 2003"
#> [33] "maakunta 2002 -> suuralue 2003" "maakunta 2002 -> suuralue 2001"
#> [35] "maakunta 2001 -> suuralue 2001" "maakunta 1998 -> nuts 2003"
#> [37] "maakunta 1998 -> nuts 1999" "maakunta 1998 -> tiepiiri 1998"
#> [39] "maakunta 2010 -> avi 2010" "seutukunta 1998 -> maakunta 1998"
#> [41] "maakunta 2012 -> seutukunta 2012" "maakunta 2007 -> suuralue 2003"
#> [43] "maakunta 2012 -> suuralue 2012" "maakunta 2011 -> suuralue 2011"
#> [45] "seutukunta 2005 -> maakunta 1998" "seutukunta 2003 -> maakunta 1998"
#> [47] "seutukunta 2001 -> maakunta 1998" "seutukunta 2007 -> maakunta 2007"
#> [49] "seutukunta 2009 -> maakunta 2009" "seutukunta 2010 -> maakunta 2010"
#> [51] "seutukunta 2011 -> maakunta 2011" "maakunta 2016 -> suuralue 2016"
#> [53] "kunta 2017 -> maakunta 2017" "kunta 2019 -> maakunta 2019"
#> [55] "kunta 2018 -> maakunta 2018" "kunta 2020 -> maakunta 2020"
#> [57] "kunta 2021 -> maakunta 2021" "maakunta 2021 -> suuralue 2021"
#> [59] "kunta 2022 -> maakunta 2022" "maakunta 2022 -> suuralue 2022"
#> [61] "kunta 2023 -> maakunta 2023" "kunta 2024 -> maakunta 2024"
#> [63] "kunta 1996 -> maakunta 1995"
search_keys("kunta", "suuralue", 2020)
#> [1] "kunta 2020 -> suuralue 2020"
The open classifications API uniquely identifies each classification key
by a local ID. Having found the suitable key, you can use
search_keys()
with an argument as_localId = TRUE
to print the
localId of the key:
localId <- search_keys(source = "kunta", target = "suuralue", year = 2020, as_localId = TRUE)
localId
#> [1] "kunta_1_20200101%23suuralue_1_20200101"
Then you can get the key by:
key <- get_key(localId)
#> Vuoden 2020 kuntien ja suuralueiden välinen luokitusavain
head(key)
#> source_code source_name target_code target_name
#> 1 407 Lapinjärvi 1 Helsinki-Uusimaa
#> 2 504 Myrskylä 1 Helsinki-Uusimaa
#> 3 505 Mäntsälä 1 Helsinki-Uusimaa
#> 4 638 Porvoo 1 Helsinki-Uusimaa
#> 5 755 Siuntio 1 Helsinki-Uusimaa
#> 6 753 Sipoo 1 Helsinki-Uusimaa
To list all available classifications, use search_classifications()
without arguments:
head(search_classifications())
#> [1] "verolaji 2019" "coicop 2018" "nettovaral_luok 2023"
#> [4] "koulutusaste 2018" "suuruusluokka 2018" "erva 2020"
Plug in search terms to search for available classifications:
search_classifications("ammatti")
#> [1] "ammatti 2010" "ammatti 2001" "ammatti 1987" "ammatti 1980" "ammatti 1997"
#> [6] "ammatti 2018" "ammatti 2021"
search_classifications("ammatti", 2021)
#> [1] "ammatti 2021"
To print the localId of the desired classification, use
as_localId = TRUE
.
localId <- search_classifications("ammatti", year = 2021, as_localId = TRUE)
Use localId to get the classification
head(get_classification(localId))
#> Tulorekisterin TK10-ammattiluokitus
#> code name
#> 1 01100 Upseerit
#> 2 02100 Aliupseerit
#> 3 03100 Sotilasammattihenkilöstö
#> 4 11110 Lainsäätäjät
#> 5 11121 Valtion keskushallinnon johtajat
#> 6 11122 Alue- ja paikallishallinnon johtajat ja ylimmät virkamiehet
For regional classifications, statficlassifications
has tailored
tools.
To load regional classifications, use get_regionclassification
:
region_classification <- get_regionclassification()
head(region_classification)
#> alue_code alue_name
#> 1 KU020 Akaa
#> 2 KU004 Alahärmä
#> 3 KU005 Alajärvi
#> 4 KU006 Alastaro
#> 5 KU009 Alavieska
#> 6 KU010 Alavus
seutukunta_classification <- get_regionclassification("seutukunta")
head(seutukunta_classification)
#> seutukunta_code seutukunta_name
#> 1 SK063 Etelä-Pirkanmaa
#> 2 SK143 Eteläiset seinänaapurit
#> 3 SK053 Forssa
#> 4 SK072 Heinola
#> 5 SK011 Helsinki
#> 6 SK051 Hämeenlinna
kunta_2010_classification <- get_regionclassification("kunta", year = 2010)
#> Overriding default option for offline when specific year is required.
head(kunta_2010_classification)
#> kunta_code kunta_name
#> 1 KU020 Akaa
#> 2 KU005 Alajärvi
#> 3 KU009 Alavieska
#> 4 KU010 Alavus
#> 5 KU015 Artjärvi
#> 6 KU016 Asikkala
To load regional correspondence tables / classification keys, use
get_regionkey
:
regionkey <- get_regionkey(only_names = TRUE)
head(regionkey)
#> kunta_name seutukunta_name maakunta_name
#> 1 Akaa Etelä-Pirkanmaa Pirkanmaa
#> 2 Urjala Etelä-Pirkanmaa Pirkanmaa
#> 3 Valkeakoski Etelä-Pirkanmaa Pirkanmaa
#> 4 Forssa Forssa Kanta-Häme
#> 5 Tammela Forssa Kanta-Häme
#> 6 Humppila Forssa Kanta-Häme
regionkey <- get_regionkey(only_codes = TRUE)
head(regionkey)
#> kunta_code seutukunta_code maakunta_code
#> 1 KU020 SK063 MK06
#> 2 KU887 SK063 MK06
#> 3 KU908 SK063 MK06
#> 4 KU061 SK053 MK05
#> 5 KU834 SK053 MK05
#> 6 KU103 SK053 MK05
You can also get more specialised regional classification keys by setting arguments:
kunta_maakunta_key <- get_regionkey("kunta", "maakunta")
head(kunta_maakunta_key)
#> kunta_name maakunta_name kunta_code maakunta_code
#> 1 Akaa Pirkanmaa KU020 MK06
#> 2 Urjala Pirkanmaa KU887 MK06
#> 3 Valkeakoski Pirkanmaa KU908 MK06
#> 4 Forssa Kanta-Häme KU061 MK05
#> 5 Tammela Kanta-Häme KU834 MK05
#> 6 Humppila Kanta-Häme KU103 MK05
statficlassifications
also gives you a selection of ways and functions
that help you to standardize and manipulate the regional information in
your data. For examples, generate random municipal data:
data <- get_regionkey() |> dplyr::select(kunta_name) |> dplyr::mutate(values = rnorm(dplyr::n()))
head(data)
#> kunta_name values
#> 1 Akaa 1.6162003
#> 2 Urjala -0.9998084
#> 3 Valkeakoski 0.4579793
#> 4 Forssa 0.2866772
#> 5 Tammela 0.3237500
#> 6 Humppila 0.5426492
You can use regional classification tables to add regions to your data:
dplyr::left_join(data, get_regionkey(only_names = TRUE), by = "kunta_name") |> head()
#> kunta_name values seutukunta_name maakunta_name
#> 1 Akaa 1.6162003 Etelä-Pirkanmaa Pirkanmaa
#> 2 Urjala -0.9998084 Etelä-Pirkanmaa Pirkanmaa
#> 3 Valkeakoski 0.4579793 Etelä-Pirkanmaa Pirkanmaa
#> 4 Forssa 0.2866772 Forssa Kanta-Häme
#> 5 Tammela 0.3237500 Forssa Kanta-Häme
#> 6 Humppila 0.5426492 Forssa Kanta-Häme
For a shortcut, use add_region
:
data |> add_region("maakunta") |> head()
#> The region classification in data seems to fit year(s) 2021. A key corresponding to this year is used.
#> Input x not recoded.
#> kunta_name values maakunta
#> 1 Akaa 1.6162003 Akaa
#> 2 Urjala -0.9998084 Urjala
#> 3 Valkeakoski 0.4579793 Valkeakoski
#> 4 Forssa 0.2866772 Forssa
#> 5 Tammela 0.3237500 Tammela
#> 6 Humppila 0.5426492 Humppila
It is also straightforward to compute, say, maakunta-level means give the municipal data.
data |> add_region("maakunta") |>
dplyr::group_by(maakunta) |>
dplyr::summarize(maakunta_mean = mean(values)) |> head()
#> The region classification in data seems to fit year(s) 2021. A key corresponding to this year is used.
#> Input x not recoded.
#> # A tibble: 6 × 2
#> maakunta maakunta_mean
#> <fct> <dbl>
#> 1 Äänekoski 0.189
#> 2 Ähtäri 1.99
#> 3 Akaa 1.62
#> 4 Alajärvi 1.04
#> 5 Alavieska -0.0937
#> 6 Alavus 1.20
statficlassifications
aims to impose the use of prefixed region codes.
Prefixes are useful in making sure codes do not map to multiple names.
There are, for instance, kuntia and seutukuntia with identical numbers.
Prefixes help distinguish between these. The correpondence of region
names and prefixes is
#> prefix name
#> 1 SSS KOKO MAA
#> 2 KU kunta
#> 3 SK seutukunta
#> 4 MK maakunta
#> 5 SA suuralue
#> 6 ELY ely
#> 7 MA manner_suomi_ahvenanmaa
#> 8 TK kuntaryhmitys
Function set_region_codes
helps you set the prefixes of you region
codes:
v <- c("191", "047", "063")
set_region_codes(v)
#> [1] "SK191" "KU047" "SK063"
If your input vector has codes that may match to multiple region codes,
you get an error without giving more information. If you know the region
level to which your codes correspond to you can submit the region level
to region_level
-argument.
v <- c("5", "47", "20")
set_region_codes(v, region_level = "kunta")
#> [1] "KU005" "KU047" "KU020"
Often you find data where kunta codes are not prefixed but the codes of
other regions are. With your information on to which region level codes
the codes without prefixes should be mapped, you can restrict the domain
of this mapping by providing the region_level
-argument:
v <- c("020", "047", "005", "MK01", "MK02")
set_region_codes(v, region_level = "kunta")
#> [1] "KU020" "KU047" "KU005" "MK01" "MK02"
Again sometimes you have codes of multiple region levels mixed in the
data without prefixes but the charachter length of the codes might make
a difference. In these cases you can set use_char_length_info = TRUE
.
The default is way to use character length information is that three
characters incidate kuntia and two characters indicate maakuntia.
v <- c("020", "047", "005", "01", "02")
set_region_codes(v, use_char_length_info = TRUE)
#> [1] "KU020" "KU047" "KU005" "MK01" "MK02"
Prefixed region codes map unambiguosly to region names so it is then
easy to map the codes to names. Note, however, that some region names do
not map uniquely to region codes. In these cases you have to supply the
region_level
-argument again.
v <- c("020", "047", "005", "MK01", "MK02")
v <- set_region_codes(v, region_level = "kunta")
codes_to_names(v)
#> [1] "Akaa" "Enontekiö" "Alajärvi" "Uusimaa"
#> [5] "Varsinais-Suomi"
v <- codes_to_names(v)
names_to_codes(v)
#> [1] "KU020" "KU047" "KU005" "MK01" "MK02"
statficlassifications
ships with a powerful tool to recode variables
with keys and classifications. Recoding of categorical variables are
often based on keys that map one classification of a categorical
variable to an another classification of this same variable.
key_recode()
recodes an input given a such a key. The key can be in
the form of a named vector or a data.frame. The inputs can be vectors,
factors of data.frames.
var1 <- c("a", "b", "a", "c", "b", "b")
key <- data.frame(var1 = letters[1:4],
var2 = c("first letter",
"second letter",
"third letter",
"fourth letter"),
var3 = 1:4)
By default, the variable in key corresponding to the variable to be recoded is found by name:
key_recode(var1, key, to = "var2")
#> [1] "first letter" "second letter" "first letter" "third letter"
#> [5] "second letter" "second letter"
The corresponding variable can however, be found using the values of variables as well:
v <- var1
key_recode(v, key, to = "var2", by = "values")
#> [1] "first letter" "second letter" "first letter" "third letter"
#> [5] "second letter" "second letter"
data.frames can be recoded as well:
df <- data.frame(var1 = var1,
y = 1:6)
key_recode(df, key, to = "var2")
#> Input y not recoded.
#> var2 y
#> 1 first letter 1
#> 2 second letter 2
#> 3 first letter 3
#> 4 third letter 4
#> 5 second letter 5
#> 6 second letter 6
Without ‘to’-argument, recoding happens to all variable other than ‘from’-variable in the key.
key_recode(var1, key)
#> var2 var3
#> 1 first letter 1
#> 2 second letter 2
#> 3 first letter 1
#> 4 third letter 3
#> 5 second letter 2
#> 6 second letter 2
Keys can be named vectors as well:
key <- c("a" = "first letter",
"b" = "second letter",
"c" = "third letter",
"d" = "fourth letter")
key_recode(v, key)
#> [1] "first letter" "second letter" "first letter" "third letter"
#> [5] "second letter" "second letter"
Note that with named vectors the function knows which way you are recoding:
v <- key_recode(v, key)
v
#> [1] "first letter" "second letter" "first letter" "third letter"
#> [5] "second letter" "second letter"
v <- key_recode(v, key)
v
#> [1] "a" "b" "a" "c" "b" "b"