Duplicated variable columns in geospatial data

pitkant commented 2 years ago

Currently different geospatial datasets have the following columns:

year / variable	2003	2006	2010	2013	2016	2021
id	x	x	x	x	x	x
LEVL_CODE	x	x	x	x	x	x
NUTS_ID	x	x	x	x	x	x
CNTR_CODE	x	x	x	x	x	x
NAME_LATN		x	x	x	x	x
NUTS_NAME	x	x	x	x	x	x
MOUNT_TYPE					x	x
URBN_TYPE					x	x
COAST_TYPE					x	x
FID	x	x	x	x	x	x
geometry	x	x	x	x	x	x
geo	x	x	x	x	x	x

Of these, at least in years 2016 and 2021, the following variables contain identical information: id, NUTS_ID, FID and geo. The id column is the unique identifier from geojson and not included in the csv file. The geo column is generated at the end of get_eurostat_geospatial "for easier joins with dplyr", as well as in data generation script data_spatial.R.

While some of this overlap is due to eurostat data itself containing duplicated columns, is geo column still necessary?

antagomir commented 2 years ago

If this can be easily retrieved otherwise when needed (example, maybe?) then I guess it could also removed.

pitkant commented 1 year ago

Addressed partly in v4-dev branch and PR #264. geo column is now marked in get_eurostat_geospatial function documentation as "Questioning", offering us some more time to discuss whether we should remove it or keep it in the future.

I will close this issue when v4 is released

rOpenGov / eurostat

Duplicated variable columns in geospatial data #240