vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
342 stars 84 forks source link

New codes for some commonly used MRIO database? #323

Closed christophe-gouel closed 1 year ago

christophe-gouel commented 1 year ago

Hi,

Are you interested by and is this package a good place for mappings used in many multi-region input-output (MRIO) tables? MRIO tables such as GTAP, WIOD, EXIOBASE, and others are commonly used in economics. They usually represent the world as a limited number of countries with the remaining countries gathered in various Rest of the World regions. It would be super useful to have a package that contains the mappings with the various country codes they use and the mappings between the individual countries and the rest of the world.

vincentarelbundock commented 1 year ago

Hey, thanks for the suggestion. Yes, I think this would be interesting.

One tricky aspect is that those mappings do not seem to be one-to-one, so we would need to think about what conversion directions are allowed. And maybe this shouldn't go in the main dictionary, but we could distribute a custom dictionary that users could easily plug in the custom_dict argument.

To be honest, I don't have a lot of bandwidth to work on this now, but if you can supply a single R script which does the following, I'll see what I can do to make it easy to use for countrycode users:

  1. Scrape the codes from an official source online.
  2. Merge the codes as new columns in this file: https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/data_regex.csv
christophe-gouel commented 1 year ago

I would be happy to help. However, I am wondering what is the point of the scraping part. For these databases, there exists mappings lying around (we have several where I work) but not always in easily scrapable forms, so it would be easier just to provide csv files with the mappings. These databases being fixed once produced, there is no need to update a mapping once existing.

vincentarelbundock commented 1 year ago

In general, we prefer to have direct scrape code from an official source because it can help us update the package when the sources update. This has been helpful numerous times in the past.

If that's not possible, or if it's too much work, it's probably not a big deal.

christophe-gouel commented 1 year ago

Understood.

I will do something soon.

cjyetman commented 1 year ago

Here's a start on some scrapers...

get_gtap.R

library(rvest)
library(dplyr)
library(tidyr)
library(countrycode)

url <- "https://www.gtap.agecon.purdue.edu/databases/regions.aspx?version=10.211"

rvest::read_html(x = url) %>%
  rvest::html_element(css = "table#Regions") %>%
  rvest::html_table() %>%
  tidyr::separate_wider_delim(
    cols = "Description",
    delim = " - ",
    names_sep = "_",
    too_few = "align_start"
  ) %>%
  dplyr::rename(
    gtap_num = "Number",
    gtap_chr = "Code",
  ) %>%
  dplyr::mutate(gtap_name = .data$Description_1) %>%
  tidyr::pivot_longer(
    cols = tidyr::starts_with("Description_"),
    names_to = NULL,
    values_to = "country",
    values_drop_na = TRUE
  ) %>%
  dplyr::filter(!grepl("Rest of ", country)) %>%
  dplyr::mutate(
    country.name = countrycode::countrycode(.data$country, "country.name", "country.name", warn = FALSE),
    .keep = "unused",
    .before = 1
  ) %>%
  dplyr::filter(!is.na(.data$country.name)) %>%
  dplyr::distinct()

get_desire.R

library(readxl)
library(dplyr)
library(countrycode)

# CountryMappingDESIRE.xlsx from https://ntnu.app.box.com/v/EXIOBASEconcordances
url <- "https://ec2euc1.boxcloud.com/d/1/a1!cs4y1r6iTVju7tMmmvl6IXCY0GqXuScHnYoDiHoHe-ADi5bmcSXi_AjVBRfsiYjEpmvH-B9N0735kCg7HJ-xY9nYjZ8FZz9pFBjXs04Ir5nToc7NZAbVFHTf8l8hVIpvGSmyRZ9U_bzZ7m51PF7xdk10bCvhvMp4kI44NuoBooXSqktyRWeEYgGUr6zfcRFOwAAsCUiTV484-awjzS5ej8Gbs55T69hTWr1xaYcmylCuf3KWSQTXos7IWwZgwD1kDZFon-lK8YkIbiifbxKK_jWceL5wIjULtrPrfEnwPD8eeIDXlZy8IHTM-idedK16sshjUTyneayCDYfDXnM-okspW8MC7hTZYte_VCRoAAQFVaLKZPBCSAEghnihz86W9EBwYXcU23IwxM688A6F__JAE5c70w-tvKyylL_5dPAjpKEVzRUgSPrckInPxBnENnTCsO_9S9Hu6Yq9fxx1tcYqQmEu8Vbls8fHFVWWp6BnbD3XPQNv_NJGnGty7SfSlRAY_CIAqmOJwjNdeIo8lCYrr-FZRho_S6ZfmvXGzI-FM7J3GQ0FkRwUvZjoYDjMwkFMiyAGW7yrVRSp7WiUkfiuKx3WAguSrcSQ5WBp9fVAPkIzpe57nKOVI5meEjU9JGiwRoamU0e_OEh-1pAlwh6t5f6Y2I6q78e3-HpOnL2l-2qLZR4OaA30rF54F0vtuhHEaYWvdGmEuc8UEDydVE6R4-TAPHN9_ocwP9d0V7Q-OVlzfuEOCbqDCB2lh2uaSvEJbp5aesTP2JSz_75Q-qbu0MrGqmMqhqBTIVvO247uOV2Jt0puPO5C139w58Zw3hkhmjMjsExIAGPncivP2-SFIXHnndCmjb7dIcOFNLG-Z-eWFTwuhwxzzN-StDka8QErUFverbcRHdZtjFdfW1eggmNpNQSGpWxF766ibaGFyaF6aGnMnEto2gwcquk29Q565-M5oVcfepEpxiUFynKcTVMRLuRdBLGvSJKjFuQW4Hy-m_-tIcpYorSo-Mee4YN5z0pNv7TIWQaAxmR3Oup-GFMcmtf41UNdBWCAiq4He_XwYqitkOZTN_AW-UBAoy03XVFAlE0chnzyFcGlbwttwtAg_PrV4oQh4_abkiGz9GJhjVQLxpJMH0lVc4ucNaWdME38_yZ5dTf3rwY-DubMN0j5PcL6zx7yrkWWfWZ-gFaBtqij9h3nSY_nwu_i6G4vyU530Ir_zQYxWCgjy0A-GlqdgYwkoO0_j8LcgnDuAUaOKb_rYzSnOlJh6MPuxk34T_h-wlG1LPNwvgPl6_7CUfFPZFE-zFbckMYLj8CEEGCiZcaG6ODnoerDIaxZeL3FfNXMckFdubq43rRhN1SeFrwNWoskw4lk5cTSUWa0rb5FNNAUc1KFyTOxTHNQJd0LgOLpkWMRCMj1eH6n6wkLYhspTgN0/download"

tmp <- tempfile()
download.file(url, tmp, quiet = TRUE)
data <- readxl::read_excel(tmp, sheet = "CountryList")
unlink(tmp)

data %>%
  dplyr::mutate(
    country.name = countrycode::countrycode(.data$Name, "country.name", "country.name", warn = FALSE)
  ) %>%
  dplyr::filter(!is.na(.data$country.name)) %>%
  dplyr::select(
    "country.name",
    desire_code = "DESIRE code",
    desire_num = "DESIRE Number",
    desire_region = "DESIRE region",
    desire_region_name = "DESIRE region name",
  )
christophe-gouel commented 1 year ago

Thanks a lot for the head start @cjyetman. I will complete that.

christophe-gouel commented 1 year ago

After some adaptations, I have created a PR with the two datasets for which @cjyetman proposed some code.

2 comments: