ropensci / rnaturalearth

An R package to hold and facilitate interaction with natural earth map data :earth_africa:
http://ropensci.github.io/rnaturalearth/
Other
214 stars 24 forks source link

Why all factor variables? #13

Closed dpprdan closed 7 years ago

dpprdan commented 7 years ago

Why are all attributes factor variables in ne_countries?

> library("rnaturalearth")
> map_ne <- ne_countries()
> str(map_ne@data)
'data.frame':   177 obs. of  63 variables:
 $ scalerank : Factor w/ 2 levels "1","3": 1 1 1 1 1 1 1 2 1 1 ...
 $ featurecla: Factor w/ 1 level "Admin-0 country": 1 1 1 1 1 1 1 1 1 1 ...
 $ labelrank : Factor w/ 6 levels "2","3","4","5",..: 2 2 5 3 1 5 3 5 1 3 ...
 $ sovereignt: Factor w/ 171 levels "Afghanistan",..: 1 4 2 159 6 7 5 52 8 9 ...
 $ sov_a3    : Factor w/ 171 levels "AFG","AGO","ALB",..: 1 2 3 4 5 6 7 54 8 9 ...
 $ adm0_dif  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 1 ...
 $ level     : Factor w/ 1 level "2": 1 1 1 1 1 1 1 1 1 1 ...
 $ type      : Factor w/ 5 levels "Country","Dependency",..: 5 5 5 5 5 5 4 2 1 5 ...
 $ admin     : Factor w/ 177 levels "Afghanistan",..: 1 4 2 165 6 7 5 54 8 9 ...
 $ adm0_a3   : Factor w/ 177 levels "AFG","AGO","ALB",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ geou_dif  : Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
 $ geounit   : Factor w/ 177 levels "Afghanistan",..: 1 4 2 166 6 7 5 54 8 9 ...
 $ gu_a3     : Factor w/ 177 levels "AFG","AGO","ALB",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ su_dif    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ subunit   : Factor w/ 177 levels "Afghanistan",..: 1 4 2 166 6 7 5 54 8 9 ...
 $ su_a3     : Factor w/ 177 levels "AFG","AGO","ALB",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ brk_diff  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ name      : Factor w/ 177 levels "Afghanistan",..: 1 4 2 166 6 7 5 56 8 9 ...
 $ name_long : Factor w/ 177 levels "Afghanistan",..: 1 4 2 166 6 7 5 56 8 9 ...
 $ brk_a3    : Factor w/ 177 levels "AFG","AGO","ALB",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ brk_name  : Factor w/ 177 levels "Afghanistan",..: 1 4 2 166 6 7 5 55 8 9 ...
 $ brk_group : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
 $ abbrev    : Factor w/ 177 levels "Afg.","Alb.",..: 1 4 2 164 6 7 5 54 9 8 ...
 $ postal    : Factor w/ 172 levels "A","AE","AF",..: 3 5 4 2 7 8 6 149 9 1 ...
 $ formal_en : Factor w/ 174 levels "Arab Republic of Egypt",..: 38 68 73 169 2 74 NA 164 10 75 ...
 $ formal_fr : Factor w/ 4 levels "Nouvelle-Caldonie",..: NA NA NA NA NA NA NA NA NA NA ...
 $ note_adm0 : Factor w/ 6 levels "Commonwealth of U.S.A.",..: NA NA NA NA NA NA NA 3 NA NA ...
 $ note_brk  : Factor w/ 8 levels "Admin. by U.K.; Claimed by Argentina",..: NA NA NA NA NA NA 2 NA NA NA ...
 $ name_sort : Factor w/ 177 levels "Afghanistan",..: 1 4 2 166 6 7 5 57 8 9 ...
 $ name_alt  : Factor w/ 2 levels "East Timor","Islas Malvinas": NA NA NA NA NA NA NA NA NA NA ...
 $ mapcolor7 : Factor w/ 7 levels "1","2","3","4",..: 5 3 1 2 3 3 4 7 1 3 ...
 $ mapcolor8 : Factor w/ 8 levels "1","2","3","4",..: 6 2 4 1 1 1 5 5 2 1 ...
 $ mapcolor9 : Factor w/ 9 levels "1","2","3","4",..: 8 6 1 3 3 2 1 9 2 3 ...
 $ mapcolor13: Factor w/ 14 levels "-99","1","10",..: 12 2 11 8 6 3 1 4 12 9 ...
 $ pop_est   : Factor w/ 177 levels "-99","10057975",..: 72 20 95 118 104 76 96 26 52 160 ...
 $ gdp_md_est: Factor w/ 177 levels "-99","10040",..: 67 11 65 45 140 46 155 39 159 98 ...
 $ pop_year  : Factor w/ 2 levels "-99","0": 1 1 1 1 1 1 1 1 1 1 ...
 $ lastcensus: Factor w/ 27 levels "-99","1970","1979",..: 3 2 16 25 25 16 1 1 21 26 ...
 $ gdp_year  : Factor w/ 3 levels "-99","0","2009": 1 1 1 1 1 1 1 1 1 1 ...
 $ economy   : Factor w/ 7 levels "1. Developed region: G7",..: 7 7 6 6 5 6 6 6 2 2 ...
 $ income_grp: Factor w/ 5 levels "1. High income: OECD",..: 5 3 4 2 3 4 2 2 1 1 ...
 $ wikipedia : Factor w/ 2 levels "-99","0": 1 1 1 1 1 1 1 1 1 1 ...
 $ fips_10   : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
 $ iso_a2    : Factor w/ 175 levels "-99","AE","AF",..: 3 6 4 2 8 5 7 153 10 9 ...
 $ iso_a3    : Factor w/ 175 levels "-99","AFG","AGO",..: 2 3 4 5 6 7 8 9 10 11 ...
 $ iso_n3    : Factor w/ 175 levels "-99","004","008",..: 2 6 3 159 8 13 4 57 9 10 ...
 $ un_a3     : Factor w/ 172 levels "-099","004","008",..: 2 5 3 156 7 12 1 1 8 9 ...
 $ wb_a2     : Factor w/ 171 levels "-99","AE","AF",..: 3 6 4 2 7 5 1 1 9 8 ...
 $ wb_a3     : Factor w/ 171 levels "-99","AFG","AGO",..: 2 3 4 5 6 7 1 1 8 9 ...
 $ woe_id    : Factor w/ 1 level "-99": 1 1 1 1 1 1 1 1 1 1 ...
 $ adm0_a3_is: Factor w/ 173 levels "AFG","AGO","ALB",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ adm0_a3_us: Factor w/ 175 levels "AFG","AGO","ALB",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ adm0_a3_un: Factor w/ 1 level "-99": 1 1 1 1 1 1 1 1 1 1 ...
 $ adm0_a3_wb: Factor w/ 1 level "-99": 1 1 1 1 1 1 1 1 1 1 ...
 $ continent : Factor w/ 8 levels "Africa","Antarctica",..: 3 1 4 3 8 3 2 7 6 4 ...
 $ region_un : Factor w/ 7 levels "Africa","Americas",..: 4 1 5 4 2 4 3 7 6 5 ...
 $ subregion : Factor w/ 22 levels "Antarctica","Australia and New Zealand",..: 18 10 19 21 16 21 1 14 2 22 ...
 $ region_wb : Factor w/ 8 levels "Antarctica","East Asia & Pacific",..: 7 8 3 5 4 3 1 8 2 3 ...
 $ name_len  : Factor w/ 16 levels "10","11","12",..: 2 13 14 9 16 14 1 10 16 14 ...
 $ long_len  : Factor w/ 21 levels "10","11","12",..: 2 18 19 11 21 19 1 15 21 19 ...
 $ abbrev_len: Factor w/ 8 levels "10","3","4","5",..: 3 3 3 5 3 3 3 1 3 4 ...
 $ tiny      : Factor w/ 5 levels "-99","2","3",..: 1 1 1 1 1 1 1 2 1 1 ...
 $ homepart  : Factor w/ 2 levels "-99","1": 2 2 2 2 2 2 2 1 2 2 ...

I guess this might be related to "keep the data as close as possible to Natural Earth", but character and numeric make so much more sense, IMHO.

Also missings (-99, -099 (see un_a3)), should be coded as NA, IMO.

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rnaturalearth_0.1.0 pacman_0.4.1       

loaded via a namespace (and not attached):
[1] tools_3.3.2     sp_1.2-4        grid_3.3.2      lattice_0.20-34
andysouth commented 7 years ago

Thanks @dpprdan well spotted. Both should now be fixed. Can you check ? If so I'll close the issue.

dpprdan commented 7 years ago

I just did a quick check and it looks fine to me now. One thing I noticed only now is that quite a few variables have only one or two levels (when they were all factors) or a lot of NAs (now), e.g. gdp_year, pop_year, fips_10, wikipedia etc. gdp_year == 0 also does not make a lot of sense, does it? So is this data missing in the source or is it an import issue. And if it's something that cannot be fixed, it might be a good idea to drop these variables. Again, this might not be "as close as possible to NE" but what good are these variables if there is not any information in them? But maybe this is a new issue?

andysouth commented 7 years ago

Thanks, the missing data in the cases I've looked at are present in the shapefiles from Natural Earth. You might want to raise that up with them. The problems with dropping variables are that the code becomes more complex, has to choose which to drop and would need to be changed if they are fixed. All make the code difficult to maintain.