reconhub / linelist

An R package to import, clean, and store case data
https://www.repidemicsconsortium.org/linelist
Other
25 stars 5 forks source link

clean_data creates duplicate column names #100

Closed ffinger closed 4 years ago

ffinger commented 4 years ago
library(linelist)
library(dplyr)
library(magrittr)
data(iris)
iris %<>%
  mutate(sepal.length = Sepal.Length) %>%
  clean_data()
glimpse(iris)
#> Observations: 150
#> Variables: 6
#> $ sepal_length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5…
#> $ sepal_width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3…
#> $ petal_length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1…
#> $ petal_width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0…
#> $ species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, set…
#> $ sepal_length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5…

Duplicated names then cause problems for many functions applied to the data.frame.

I think there should at least be a warning if after cleaning some columns get the same name.

Or alternatively clean_data should detect the duplicated column names and add _1, _2 or similar to the end, in addition to the warning.

zkamvar commented 4 years ago

Hi @ffinger, Thank you for providing a simple reproducible example and potential solution. I'll make a PR for this soon.