rOpenGov / eurostat

R tools for Eurostat data
http://ropengov.github.io/eurostat
Other
234 stars 46 forks source link

Duplicated variable columns in geospatial data #240

Open pitkant opened 2 years ago

pitkant commented 2 years ago

Currently different geospatial datasets have the following columns:

year / variable 2003 2006 2010 2013 2016 2021
id x x x x x x
LEVL_CODE x x x x x x
NUTS_ID x x x x x x
CNTR_CODE x x x x x x
NAME_LATN x x x x x
NUTS_NAME x x x x x x
MOUNT_TYPE x x
URBN_TYPE x x
COAST_TYPE x x
FID x x x x x x
geometry x x x x x x
geo x x x x x x

Of these, at least in years 2016 and 2021, the following variables contain identical information: id, NUTS_ID, FID and geo. The id column is the unique identifier from geojson and not included in the csv file. The geo column is generated at the end of get_eurostat_geospatial "for easier joins with dplyr", as well as in data generation script data_spatial.R.

While some of this overlap is due to eurostat data itself containing duplicated columns, is geo column still necessary?

antagomir commented 2 years ago

If this can be easily retrieved otherwise when needed (example, maybe?) then I guess it could also removed.

pitkant commented 1 year ago

Addressed partly in v4-dev branch and PR #264. geo column is now marked in get_eurostat_geospatial function documentation as "Questioning", offering us some more time to discuss whether we should remove it or keep it in the future.

I will close this issue when v4 is released