vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
342 stars 83 forks source link

Pseudo-NAs in `codelist` #295

Closed salim-b closed 2 years ago

salim-b commented 2 years ago

I noticed that in codelist there's a mixture of true NAs and "NA" strings. Apart from columns that contain two-letter country codes like iso2c where "NA" stands for Namibia, I think there should be no "NA" strings present.

Or am I wrong?


To display only affected rows and columns from codelist, use:

countrycode::codelist %>%
    dplyr::filter(dplyr::if_any(.cols = -iso2c,
                                .fns = ~ .x == "NA")) %>%
    dplyr::select(where(~ "NA" %in% .x))
List of affected columns: ``` ar5 ecb eu28 eurocontrol_pru eurocontrol_statfor eurostat genc2c iso2c region23 un.region.code un.regionintermediate.code un.regionsub.code wb_api2c cldr.name.bem cldr.name.bo cldr.name.cu cldr.name.dua cldr.name.dyo cldr.name.gv cldr.name.haw cldr.name.ii cldr.name.kkj cldr.name.kl cldr.name.kw cldr.name.lkt cldr.name.lrc cldr.name.mgh cldr.name.mgo cldr.name.mi cldr.name.nnh cldr.name.nus cldr.name.om cldr.name.os cldr.name.pa_arab cldr.name.rw cldr.name.sah cldr.name.uz_arab cldr.name.xh cldr.short.bo cldr.short.gv cldr.short.haw cldr.short.ii cldr.short.lkt cldr.short.lrc cldr.short.mgh cldr.short.mi cldr.short.om cldr.short.os cldr.short.sah cldr.variant.dyo cldr.variant.mgh cldr.variant.mi cldr.variant.nus cldr.variant.rw cldr.variant.xh ```
vincentarelbundock commented 2 years ago

No, I think there is something wrong with your code:

library(dplyr)
library(countrycode)

codelist %>%
  filter(ecb == "NA") |>
  select(country.name.en, ecb)
## # A tibble: 1 × 2
##   country.name.en ecb  
##   <chr>           <chr>
## 1 Namibia         NA
cjyetman commented 2 years ago

Maybe it's a bit of both?

library(dplyr)
library(countrycode)

codes_w_na <- c(
  "ar5",
  "eu28",
  "eurocontrol_pru",
  "eurocontrol_statfor",
  "region23",
  "un.region.code",
  "un.regionintermediate.code",
  "un.regionsub.code"
)

lapply(codes_w_na, function(code) {
  codelist %>% 
    select(country.name.en, {{code}}) %>% 
    filter(.data[[code]] == "NA")
})
#> [[1]]
#> # A tibble: 9 × 2
#>   country.name.en            ar5  
#>   <chr>                      <chr>
#> 1 Caribbean Netherlands      NA   
#> 2 German Democratic Republic NA   
#> 3 Hong Kong SAR China        NA   
#> 4 Macao SAR China            NA   
#> 5 Oman                       NA   
#> 6 Republic of Vietnam        NA   
#> 7 Saint Martin (French part) NA   
#> 8 South Korea                NA   
#> 9 St. Barthélemy             NA   
#> 
#> [[2]]
#> # A tibble: 217 × 2
#>    country.name.en   eu28 
#>    <chr>             <chr>
#>  1 Afghanistan       NA   
#>  2 Åland Islands     NA   
#>  3 Albania           NA   
#>  4 Algeria           NA   
#>  5 American Samoa    NA   
#>  6 Andorra           NA   
#>  7 Angola            NA   
#>  8 Anguilla          NA   
#>  9 Antigua & Barbuda NA   
#> 10 Argentina         NA   
#> # … with 207 more rows
#> 
#> [[3]]
#> # A tibble: 12 × 2
#>    country.name.en                        eurocontrol_pru
#>    <chr>                                  <chr>          
#>  1 Andorra                                NA             
#>  2 Bouvet Island                          NA             
#>  3 British Indian Ocean Territory         NA             
#>  4 French Southern Territories            NA             
#>  5 Heard & McDonald Islands               NA             
#>  6 Pitcairn Islands                       NA             
#>  7 San Marino                             NA             
#>  8 Senegal                                NA             
#>  9 South Georgia & South Sandwich Islands NA             
#> 10 Tokelau                                NA             
#> 11 Vatican City                           NA             
#> 12 Zanzibar                               NA             
#> 
#> [[4]]
#> # A tibble: 12 × 2
#>    country.name.en                        eurocontrol_statfor
#>    <chr>                                  <chr>              
#>  1 Andorra                                NA                 
#>  2 Bouvet Island                          NA                 
#>  3 British Indian Ocean Territory         NA                 
#>  4 French Southern Territories            NA                 
#>  5 Heard & McDonald Islands               NA                 
#>  6 Pitcairn Islands                       NA                 
#>  7 San Marino                             NA                 
#>  8 Senegal                                NA                 
#>  9 South Georgia & South Sandwich Islands NA                 
#> 10 Tokelau                                NA                 
#> 11 Vatican City                           NA                 
#> 12 Zanzibar                               NA                 
#> 
#> [[5]]
#> # A tibble: 1 × 2
#>   country.name.en  region23
#>   <chr>            <chr>   
#> 1 Christmas Island NA      
#> 
#> [[6]]
#> # A tibble: 1 × 2
#>   country.name.en un.region.code
#>   <chr>           <chr>         
#> 1 Antarctica      NA            
#> 
#> [[7]]
#> # A tibble: 141 × 2
#>    country.name.en un.regionintermediate.code
#>    <chr>           <chr>                     
#>  1 Afghanistan     NA                        
#>  2 Åland Islands   NA                        
#>  3 Albania         NA                        
#>  4 Algeria         NA                        
#>  5 American Samoa  NA                        
#>  6 Andorra         NA                        
#>  7 Antarctica      NA                        
#>  8 Armenia         NA                        
#>  9 Australia       NA                        
#> 10 Austria         NA                        
#> # … with 131 more rows
#> 
#> [[8]]
#> # A tibble: 1 × 2
#>   country.name.en un.regionsub.code
#>   <chr>           <chr>            
#> 1 Antarctica      NA
NilsEnevoldsen commented 2 years ago

Good catch. The UN ones (at least) are my fault from 4b923912861d2ed8e13763c2653a726fa942d177.

salim-b commented 2 years ago

@vincentarelbundock

With ecb there's indeed no problem as the only "NA" row stands for Namibia. But as @cjyetman showed in more detail above, there are lots of columns where no "NA" should occur.

Good catch. The UN ones (at least) are my fault from 4b92391.

I see. Either we replace all the "NA" values in the CSV sources with "" (where they don't stand for Namibia, of course) or we read in the CSVs setting the na.strings arg to c("", "NA") (or readr::read_csv()'s equivalent na arg respectively).

The issue was probably introduced with https://github.com/vincentarelbundock/countrycode/commit/a88d04056b5d28a3dddefa85c28c5c2864294b60 where you expliyitly set na = "". Relevant source position:

https://github.com/vincentarelbundock/countrycode/blob/a88d04056b5d28a3dddefa85c28c5c2864294b60/dictionary/build.R#L41

NilsEnevoldsen commented 2 years ago

I'm preparing a patch.

vincentarelbundock commented 2 years ago

@salim-b sorry for the curt and incorrect answer earlier -- someone was pushing me out the door... Thanks for the report!