sfirke / janitor

simple tools for data cleaning in R
http://sfirke.github.io/janitor/
Other
1.38k stars 132 forks source link

clean_names() creates duplicate names #251

Closed jzadra closed 4 years ago

jzadra commented 5 years ago

It seems that clean_names() can potentially create duplicate names as a product of cleaning them. Running clean names a second time fixes it. It might be good to check that there are no duplicates as a result of the cleaning, and if there are, re-run the process behind the scenes.

reprex:

require(tidyverse)
#> Loading required package: tidyverse
require(janitor)
#> Loading required package: janitor

w <- structure(list(unitid = "100690", f2a01 = "374175", f2a02 = "3123426", 
                    f2a03 = "1438541", f2a04 = "1522598", f2a05 = "162287", f2a06 = "1684885", 
                    f2a11 = "279448", f2a12 = "1028166", f2a13 = "1837548", f2a14 = "0", 
                    f2b01 = "2898929", f2b02 = "2814839", f2b03 = "0", f2b04 = "84090", 
                    f2b05 = "1600795", f2b06 = "0", f2b07 = "1684885", f2c01 = "223324", 
                    f2c02 = "11400", f2c03 = "0", f2c04 = "0", f2c05 = "0", f2c06 = "921452", 
                    f2c07 = "1156176", f2c08 = "921452", f2c09 = "0", f2d01 = "1987711", 
                    f2d02 = "0", f2d03 = "0", f2d04 = "0", f2d05 = "28770", f2d06 = "0", 
                    f2d07 = "0", f2d08 = "836984", f2d09 = "0", f2d10 = "28492", 
                    f2d11 = "0", f2d12 = "0", f2d13 = "0", f2d14 = "0", f2d15 = "16972", 
                    f2d16 = "2898929", f2e011 = "1242869", f2e012 = "696353", 
                    f2e021 = "0", f2e022 = "0", f2e031 = "0", f2e032 = "0", f2e041 = "84032", 
                    f2e042 = "58893", f2e051 = "279262", f2e052 = "175677", f2e061 = "1208676", 
                    f2e062 = "392969", f2e071 = "0", f2e072 = "0", f2e081 = "0", 
                    f2e082 = "0", f2e091 = "0", f2e092 = "0", f2e101 = "0", f2e102 = "0", 
                    f2e111 = "0", f2e112 = "0", f2e121 = "2814839", f2e122 = "1323892", 
                    f2e141 = "269211", f2e151 = "131249", f2e161 = "79703", f2e171 = "1010784", 
                    year = "1", f2e013 = NA_character_, f2e014 = NA_character_, 
                    f2e015 = NA_character_, f2e016 = NA_character_, f2e017 = NA_character_, 
                    f2e023 = NA_character_, f2e024 = NA_character_, f2e025 = NA_character_, 
                    f2e026 = NA_character_, f2e027 = NA_character_, f2e033 = NA_character_, 
                    f2e034 = NA_character_, f2e035 = NA_character_, f2e036 = NA_character_, 
                    f2e037 = NA_character_, f2e043 = NA_character_, f2e044 = NA_character_, 
                    f2e045 = NA_character_, f2e046 = NA_character_, f2e047 = NA_character_, 
                    f2e053 = NA_character_, f2e054 = NA_character_, f2e055 = NA_character_, 
                    f2e056 = NA_character_, f2e057 = NA_character_, f2e063 = NA_character_, 
                    f2e064 = NA_character_, f2e065 = NA_character_, f2e066 = NA_character_, 
                    f2e067 = NA_character_, f2e073 = NA_character_, f2e074 = NA_character_, 
                    f2e075 = NA_character_, f2e076 = NA_character_, f2e077 = NA_character_, 
                    f2e083 = NA_character_, f2e084 = NA_character_, f2e085 = NA_character_, 
                    f2e086 = NA_character_, f2e087 = NA_character_, f2e093 = NA_character_, 
                    f2e094 = NA_character_, f2e095 = NA_character_, f2e096 = NA_character_, 
                    f2e097 = NA_character_, f2e103 = NA_character_, f2e104 = NA_character_, 
                    f2e105 = NA_character_, f2e106 = NA_character_, f2e107 = NA_character_, 
                    f2e113 = NA_character_, f2e114 = NA_character_, f2e115 = NA_character_, 
                    f2e116 = NA_character_, f2e117 = NA_character_, f2e123 = NA_character_, 
                    f2e124 = NA_character_, f2e125 = NA_character_, f2e126 = NA_character_, 
                    f2e127 = NA_character_, f2e131 = NA_character_, f2e132 = NA_character_, 
                    f2e133 = NA_character_, f2e134 = NA_character_, f2e135 = NA_character_, 
                    f2e136 = NA_character_, f2e137 = NA_character_, f2h01 = NA_character_, 
                    f2h02 = NA_character_, f2fha = NA_character_, f2a05a = NA_character_, 
                    UNITID = NA_character_, F2A01 = NA_character_, F2A02 = NA_character_, 
                    F2A03 = NA_character_, F2A04 = NA_character_, F2A05 = NA_character_, 
                    F2A05A = NA_character_, F2A06 = NA_character_, F2A11 = NA_character_, 
                    F2A12 = NA_character_, F2A13 = NA_character_, F2A14 = NA_character_, 
                    F2B01 = NA_character_, F2B02 = NA_character_, F2B03 = NA_character_, 
                    F2B04 = NA_character_, F2B05 = NA_character_, F2B06 = NA_character_, 
                    F2B07 = NA_character_, F2C01 = NA_character_, F2C02 = NA_character_, 
                    F2C03 = NA_character_, F2C04 = NA_character_, F2C05 = NA_character_, 
                    F2C06 = NA_character_, F2C07 = NA_character_, F2C08 = NA_character_, 
                    F2C09 = NA_character_, F2D01 = NA_character_, F2D02 = NA_character_, 
                    F2D03 = NA_character_, F2D04 = NA_character_, F2D05 = NA_character_, 
                    F2D06 = NA_character_, F2D07 = NA_character_, F2D08 = NA_character_, 
                    F2D09 = NA_character_, F2D10 = NA_character_, F2D11 = NA_character_, 
                    F2D12 = NA_character_, F2D13 = NA_character_, F2D14 = NA_character_, 
                    F2D15 = NA_character_, F2D16 = NA_character_, F2E011 = NA_character_, 
                    F2E012 = NA_character_, F2E013 = NA_character_, F2E014 = NA_character_, 
                    F2E015 = NA_character_, F2E016 = NA_character_, F2E017 = NA_character_, 
                    F2E021 = NA_character_, F2E022 = NA_character_, F2E023 = NA_character_, 
                    F2E024 = NA_character_, F2E025 = NA_character_, F2E026 = NA_character_, 
                    F2E027 = NA_character_, F2E031 = NA_character_, F2E032 = NA_character_, 
                    F2E033 = NA_character_, F2E034 = NA_character_, F2E035 = NA_character_, 
                    F2E036 = NA_character_, F2E037 = NA_character_, F2E041 = NA_character_, 
                    F2E042 = NA_character_, F2E043 = NA_character_, F2E044 = NA_character_, 
                    F2E045 = NA_character_, F2E046 = NA_character_, F2E047 = NA_character_, 
                    F2E051 = NA_character_, F2E052 = NA_character_, F2E053 = NA_character_, 
                    F2E054 = NA_character_, F2E055 = NA_character_, F2E056 = NA_character_, 
                    F2E057 = NA_character_, F2E061 = NA_character_, F2E062 = NA_character_, 
                    F2E063 = NA_character_, F2E064 = NA_character_, F2E065 = NA_character_, 
                    F2E066 = NA_character_, F2E067 = NA_character_, F2E071 = NA_character_, 
                    F2E072 = NA_character_, F2E073 = NA_character_, F2E074 = NA_character_, 
                    F2E075 = NA_character_, F2E076 = NA_character_, F2E077 = NA_character_, 
                    F2E081 = NA_character_, F2E082 = NA_character_, F2E083 = NA_character_, 
                    F2E084 = NA_character_, F2E085 = NA_character_, F2E086 = NA_character_, 
                    F2E087 = NA_character_, F2E091 = NA_character_, F2E092 = NA_character_, 
                    F2E093 = NA_character_, F2E094 = NA_character_, F2E095 = NA_character_, 
                    F2E096 = NA_character_, F2E097 = NA_character_, F2E101 = NA_character_, 
                    F2E102 = NA_character_, F2E103 = NA_character_, F2E104 = NA_character_, 
                    F2E105 = NA_character_, F2E106 = NA_character_, F2E107 = NA_character_, 
                    F2E111 = NA_character_, F2E112 = NA_character_, F2E113 = NA_character_, 
                    F2E114 = NA_character_, F2E115 = NA_character_, F2E116 = NA_character_, 
                    F2E117 = NA_character_, F2E121 = NA_character_, F2E122 = NA_character_, 
                    F2E123 = NA_character_, F2E124 = NA_character_, F2E125 = NA_character_, 
                    F2E126 = NA_character_, F2E127 = NA_character_, F2E131 = NA_character_, 
                    F2E132 = NA_character_, F2E133 = NA_character_, F2E134 = NA_character_, 
                    F2E135 = NA_character_, F2E136 = NA_character_, F2E137 = NA_character_, 
                    F2H01 = NA_character_, F2H02 = NA_character_, F2FHA = NA_character_, 
                    F2A03A = NA_character_, F2A05B = NA_character_, F2A15 = NA_character_, 
                    F2A16 = NA_character_, F2A17 = NA_character_, F2A18 = NA_character_, 
                    F2A19 = NA_character_, F2A20 = NA_character_, F2D012 = NA_character_, 
                    F2D013 = NA_character_, F2D014 = NA_character_, F2D022 = NA_character_, 
                    F2D023 = NA_character_, F2D024 = NA_character_, F2D032 = NA_character_, 
                    F2D033 = NA_character_, F2D034 = NA_character_, F2D042 = NA_character_, 
                    F2D043 = NA_character_, F2D044 = NA_character_, F2D052 = NA_character_, 
                    F2D053 = NA_character_, F2D054 = NA_character_, F2D062 = NA_character_, 
                    F2D063 = NA_character_, F2D064 = NA_character_, F2D072 = NA_character_, 
                    F2D073 = NA_character_, F2D074 = NA_character_, F2D082 = NA_character_, 
                    F2D083 = NA_character_, F2D084 = NA_character_, F2D08A = NA_character_, 
                    F2D082A = NA_character_, F2D083A = NA_character_, F2D084A = NA_character_, 
                    F2D08B = NA_character_, F2D082B = NA_character_, F2D083B = NA_character_, 
                    F2D084B = NA_character_, F2D092 = NA_character_, F2D093 = NA_character_, 
                    F2D094 = NA_character_, F2D102 = NA_character_, F2D103 = NA_character_, 
                    F2D104 = NA_character_, F2D112 = NA_character_, F2D122 = NA_character_, 
                    F2D132 = NA_character_, F2D142 = NA_character_, F2D143 = NA_character_, 
                    F2D144 = NA_character_, F2D152 = NA_character_, F2D153 = NA_character_, 
                    F2D154 = NA_character_, F2D162 = NA_character_, F2D163 = NA_character_, 
                    F2D164 = NA_character_, F2D172 = NA_character_, F2D173 = NA_character_, 
                    F2D182 = NA_character_, F2D183 = NA_character_, F2D184 = NA_character_, 
                    F2D174 = NA_character_, F2D18 = NA_character_, I = NA_character_, 
                    F2D17 = NA_character_, F2C10 = NA_character_, `Tuition and fees  (net of allowances reported in student aid)` = NA_character_, 
                    `Federal appropriations` = NA_character_, `State appropriations` = NA_character_, 
                    `Local appropriations` = NA_character_, `Federal grants and contracts` = NA_character_, 
                    `State grants and contracts` = NA_character_, `Local grants and contracts` = NA_character_, 
                    `Private gifts, grants, contracts, and contributions from affiliated entities` = NA_character_, 
                    `Investment return (income, gains, and losses)` = NA_character_, 
                    `Sales and services of educational activities` = NA_character_, 
                    `Sales and services of auxiliary enterprises net allowances reported as student aid` = NA_character_, 
                    `Revenues from hospitals, independent operations and other sources` = NA_character_, 
                    `Total revenues and investment return` = NA_character_, `Student Aid: Pell grants` = NA_character_, 
                    `Student aid: Other federal grants` = NA_character_, `Student aid: State grants` = NA_character_, 
                    `Student aid: Local grants` = NA_character_, `Student aid: Institutional grants (funded)` = NA_character_, 
                    `Student aid: Institutional grants (unfunded)` = NA_character_, 
                    `Total Student Aid` = NA_character_, `Student aid: Portion of total student aid applied to tuition and fees` = NA_character_, 
                    `Student aid: Portion of total student aid applied to auxiliary enterprises` = NA_character_, 
                    Instruction = NA_character_, Research = NA_character_, `Public service` = NA_character_, 
                    `Academic support` = NA_character_, `Student services` = NA_character_, 
                    `Institutional support` = NA_character_, `Auxiliary enterprises` = NA_character_, 
                    `Scholarships and fellowships` = NA_character_, `Hospital services and independent operations (1997 only)` = NA_character_, 
                    `Operations and maintenance of plant` = NA_character_, `Total expenses` = NA_character_, 
                    `Total revenues and investment return` = NA_character_, `Total expenses` = NA_character_, 
                    `Other changes in net assets` = NA_character_, `Change in net assets` = NA_character_, 
                    `Net assets, beginning of the year` = NA_character_, `Adjustments to beginning net assets` = NA_character_, 
                    `Net assets, end of the year` = NA_character_, `Long-term investments` = NA_character_, 
                    `Total assets` = NA_character_, `Total liabilities` = NA_character_, 
                    `Total unrestricted net assets` = NA_character_, `Total restricted net assets` = NA_character_, 
                    `Total net assets` = NA_character_, `Land and land improvements-beginning of year` = NA_character_, 
                    `Buildings - beginning of year` = NA_character_, `Equipment, including art and library collections, beginning of year` = NA_character_, 
                    `Beginning property under capital leases (not included in equipment)` = NA_character_, 
                    f2a01_1 = NA_character_, f2a02_1 = NA_character_, f2a04_1 = NA_character_, 
                    f2a05_1 = NA_character_, f2a06_1 = NA_character_, f2a07_1 = NA_character_, 
                    f2a08_1 = NA_character_, f2a09_1 = NA_character_, f2a10_1 = NA_character_, 
                    f2a11_1 = NA_character_, f2a12_1 = NA_character_, f2a13_1 = NA_character_, 
                    f2a14_1 = NA_character_, f2a15_1 = NA_character_, f2a16_1 = NA_character_, 
                    f2a17_1 = NA_character_, f2aa01 = NA_character_, f2aa02 = NA_character_, 
                    f2aa03 = NA_character_, f2aa04 = NA_character_, f2aa05 = NA_character_, 
                    f2aa06 = NA_character_, f2aa07 = NA_character_, f2aa08 = NA_character_, 
                    f2aa09 = NA_character_, f2b01_1 = NA_character_, f2b01_2 = NA_character_, 
                    f2b02_1 = NA_character_, f2b02_2 = NA_character_, f2b03_1 = NA_character_, 
                    f2b03_2 = NA_character_, f2b04_1 = NA_character_, f2b04_2 = NA_character_, 
                    f2b05_1 = NA_character_, f2b05_2 = NA_character_, f2b06_1 = NA_character_, 
                    f2b06_2 = NA_character_, f2b07_1 = NA_character_, f2b07_2 = NA_character_, 
                    f2b08_1 = NA_character_, f2b08_2 = NA_character_, f2b09_1 = NA_character_, 
                    f2b09_2 = NA_character_, f2b10_1 = NA_character_, f2b10_2 = NA_character_, 
                    f2b11_2 = NA_character_, f2b12_1 = NA_character_, f2b12_2 = NA_character_, 
                    f2b12_3 = NA_character_, f2b12_5 = NA_character_, f2b12_6 = NA_character_, 
                    f2b12_7 = NA_character_, f2c0308 = NA_character_, f2c10 = NA_character_, 
                    f2c11 = NA_character_, f2c12 = NA_character_, f2d17 = NA_character_, 
                    f2d20 = NA_character_, f2d23 = NA_character_, f2d24 = NA_character_, 
                    f2dc014 = NA_character_, f2dc024 = NA_character_, f2dc034 = NA_character_, 
                    f2dc044 = NA_character_), row.names = c(NA, -1L), class = c("tbl_df", 
                                                                                "tbl", "data.frame"))

v <- w %>% clean_names()

names(v)[duplicated(names(v))]
#> [1] "f2b01_2" "f2b02_2" "f2b03_2" "f2b04_2" "f2b05_2" "f2b06_2" "f2b07_2"

u <- v %>% clean_names()

names(u)[duplicated(names(u))]
#> character(0)

Created on 2018-11-16 by the reprex package (v0.2.1)

sfirke commented 5 years ago

Fascinating! No one has ever pointed this out. Thank you for the reprex, I will look into it in the future (things are crazy right now).

On Fri, Nov 16, 2018, 6:03 PM Jonathan Zadra <notifications@github.com wrote:

It seems that clean_names() can potentially create duplicate names as a product of cleaning them. Running clean names a second time fixes it. It might be good to check that there are no duplicates as a result of the cleaning, and if there are, re-run the process behind the scenes.

reprex:

require(tidyverse)#> Loading required package: tidyverse require(janitor)#> Loading required package: janitor w <- structure(list(unitid = "100690", f2a01 = "374175", f2a02 = "3123426", f2a03 = "1438541", f2a04 = "1522598", f2a05 = "162287", f2a06 = "1684885", f2a11 = "279448", f2a12 = "1028166", f2a13 = "1837548", f2a14 = "0", f2b01 = "2898929", f2b02 = "2814839", f2b03 = "0", f2b04 = "84090", f2b05 = "1600795", f2b06 = "0", f2b07 = "1684885", f2c01 = "223324", f2c02 = "11400", f2c03 = "0", f2c04 = "0", f2c05 = "0", f2c06 = "921452", f2c07 = "1156176", f2c08 = "921452", f2c09 = "0", f2d01 = "1987711", f2d02 = "0", f2d03 = "0", f2d04 = "0", f2d05 = "28770", f2d06 = "0", f2d07 = "0", f2d08 = "836984", f2d09 = "0", f2d10 = "28492", f2d11 = "0", f2d12 = "0", f2d13 = "0", f2d14 = "0", f2d15 = "16972", f2d16 = "2898929", f2e011 = "1242869", f2e012 = "696353", f2e021 = "0", f2e022 = "0", f2e031 = "0", f2e032 = "0", f2e041 = "84032", f2e042 = "58893", f2e051 = "279262", f2e052 = "175677", f2e061 = "1208676", f2e062 = "392969", f2e071 = "0", f2e072 = "0", f2e081 = "0", f2e082 = "0", f2e091 = "0", f2e092 = "0", f2e101 = "0", f2e102 = "0", f2e111 = "0", f2e112 = "0", f2e121 = "2814839", f2e122 = "1323892", f2e141 = "269211", f2e151 = "131249", f2e161 = "79703", f2e171 = "1010784", year = "1", f2e013 = NAcharacter, f2e014 = NAcharacter, f2e015 = NAcharacter, f2e016 = NAcharacter, f2e017 = NAcharacter, f2e023 = NAcharacter, f2e024 = NAcharacter, f2e025 = NAcharacter, f2e026 = NAcharacter, f2e027 = NAcharacter, f2e033 = NAcharacter, f2e034 = NAcharacter, f2e035 = NAcharacter, f2e036 = NAcharacter, f2e037 = NAcharacter, f2e043 = NAcharacter, f2e044 = NAcharacter, f2e045 = NAcharacter, f2e046 = NAcharacter, f2e047 = NAcharacter, f2e053 = NAcharacter, f2e054 = NAcharacter, f2e055 = NAcharacter, f2e056 = NAcharacter, f2e057 = NAcharacter, f2e063 = NAcharacter, f2e064 = NAcharacter, f2e065 = NAcharacter, f2e066 = NAcharacter, f2e067 = NAcharacter, f2e073 = NAcharacter, f2e074 = NAcharacter, f2e075 = NAcharacter, f2e076 = NAcharacter, f2e077 = NAcharacter, f2e083 = NAcharacter, f2e084 = NAcharacter, f2e085 = NAcharacter, f2e086 = NAcharacter, f2e087 = NAcharacter, f2e093 = NAcharacter, f2e094 = NAcharacter, f2e095 = NAcharacter, f2e096 = NAcharacter, f2e097 = NAcharacter, f2e103 = NAcharacter, f2e104 = NAcharacter, f2e105 = NAcharacter, f2e106 = NAcharacter, f2e107 = NAcharacter, f2e113 = NAcharacter, f2e114 = NAcharacter, f2e115 = NAcharacter, f2e116 = NAcharacter, f2e117 = NAcharacter, f2e123 = NAcharacter, f2e124 = NAcharacter, f2e125 = NAcharacter, f2e126 = NAcharacter, f2e127 = NAcharacter, f2e131 = NAcharacter, f2e132 = NAcharacter, f2e133 = NAcharacter, f2e134 = NAcharacter, f2e135 = NAcharacter, f2e136 = NAcharacter, f2e137 = NAcharacter, f2h01 = NAcharacter, f2h02 = NAcharacter, f2fha = NAcharacter, f2a05a = NAcharacter, UNITID = NAcharacter, F2A01 = NAcharacter, F2A02 = NAcharacter, F2A03 = NAcharacter, F2A04 = NAcharacter, F2A05 = NAcharacter, F2A05A = NAcharacter, F2A06 = NAcharacter, F2A11 = NAcharacter, F2A12 = NAcharacter, F2A13 = NAcharacter, F2A14 = NAcharacter, F2B01 = NAcharacter, F2B02 = NAcharacter, F2B03 = NAcharacter, F2B04 = NAcharacter, F2B05 = NAcharacter, F2B06 = NAcharacter, F2B07 = NAcharacter, F2C01 = NAcharacter, F2C02 = NAcharacter, F2C03 = NAcharacter, F2C04 = NAcharacter, F2C05 = NAcharacter, F2C06 = NAcharacter, F2C07 = NAcharacter, F2C08 = NAcharacter, F2C09 = NAcharacter, F2D01 = NAcharacter, F2D02 = NAcharacter, F2D03 = NAcharacter, F2D04 = NAcharacter, F2D05 = NAcharacter, F2D06 = NAcharacter, F2D07 = NAcharacter, F2D08 = NAcharacter, F2D09 = NAcharacter, F2D10 = NAcharacter, F2D11 = NAcharacter, F2D12 = NAcharacter, F2D13 = NAcharacter, F2D14 = NAcharacter, F2D15 = NAcharacter, F2D16 = NAcharacter, F2E011 = NAcharacter, F2E012 = NAcharacter, F2E013 = NAcharacter, F2E014 = NAcharacter, F2E015 = NAcharacter, F2E016 = NAcharacter, F2E017 = NAcharacter, F2E021 = NAcharacter, F2E022 = NAcharacter, F2E023 = NAcharacter, F2E024 = NAcharacter, F2E025 = NAcharacter, F2E026 = NAcharacter, F2E027 = NAcharacter, F2E031 = NAcharacter, F2E032 = NAcharacter, F2E033 = NAcharacter, F2E034 = NAcharacter, F2E035 = NAcharacter, F2E036 = NAcharacter, F2E037 = NAcharacter, F2E041 = NAcharacter, F2E042 = NAcharacter, F2E043 = NAcharacter, F2E044 = NAcharacter, F2E045 = NAcharacter, F2E046 = NAcharacter, F2E047 = NAcharacter, F2E051 = NAcharacter, F2E052 = NAcharacter, F2E053 = NAcharacter, F2E054 = NAcharacter, F2E055 = NAcharacter, F2E056 = NAcharacter, F2E057 = NAcharacter, F2E061 = NAcharacter, F2E062 = NAcharacter, F2E063 = NAcharacter, F2E064 = NAcharacter, F2E065 = NAcharacter, F2E066 = NAcharacter, F2E067 = NAcharacter, F2E071 = NAcharacter, F2E072 = NAcharacter, F2E073 = NAcharacter, F2E074 = NAcharacter, F2E075 = NAcharacter, F2E076 = NAcharacter, F2E077 = NAcharacter, F2E081 = NAcharacter, F2E082 = NAcharacter, F2E083 = NAcharacter, F2E084 = NAcharacter, F2E085 = NAcharacter, F2E086 = NAcharacter, F2E087 = NAcharacter, F2E091 = NAcharacter, F2E092 = NAcharacter, F2E093 = NAcharacter, F2E094 = NAcharacter, F2E095 = NAcharacter, F2E096 = NAcharacter, F2E097 = NAcharacter, F2E101 = NAcharacter, F2E102 = NAcharacter, F2E103 = NAcharacter, F2E104 = NAcharacter, F2E105 = NAcharacter, F2E106 = NAcharacter, F2E107 = NAcharacter, F2E111 = NAcharacter, F2E112 = NAcharacter, F2E113 = NAcharacter, F2E114 = NAcharacter, F2E115 = NAcharacter, F2E116 = NAcharacter, F2E117 = NAcharacter, F2E121 = NAcharacter, F2E122 = NAcharacter, F2E123 = NAcharacter, F2E124 = NAcharacter, F2E125 = NAcharacter, F2E126 = NAcharacter, F2E127 = NAcharacter, F2E131 = NAcharacter, F2E132 = NAcharacter, F2E133 = NAcharacter, F2E134 = NAcharacter, F2E135 = NAcharacter, F2E136 = NAcharacter, F2E137 = NAcharacter, F2H01 = NAcharacter, F2H02 = NAcharacter, F2FHA = NAcharacter, F2A03A = NAcharacter, F2A05B = NAcharacter, F2A15 = NAcharacter, F2A16 = NAcharacter, F2A17 = NAcharacter, F2A18 = NAcharacter, F2A19 = NAcharacter, F2A20 = NAcharacter, F2D012 = NAcharacter, F2D013 = NAcharacter, F2D014 = NAcharacter, F2D022 = NAcharacter, F2D023 = NAcharacter, F2D024 = NAcharacter, F2D032 = NAcharacter, F2D033 = NAcharacter, F2D034 = NAcharacter, F2D042 = NAcharacter, F2D043 = NAcharacter, F2D044 = NAcharacter, F2D052 = NAcharacter, F2D053 = NAcharacter, F2D054 = NAcharacter, F2D062 = NAcharacter, F2D063 = NAcharacter, F2D064 = NAcharacter, F2D072 = NAcharacter, F2D073 = NAcharacter, F2D074 = NAcharacter, F2D082 = NAcharacter, F2D083 = NAcharacter, F2D084 = NAcharacter, F2D08A = NAcharacter, F2D082A = NAcharacter, F2D083A = NAcharacter, F2D084A = NAcharacter, F2D08B = NAcharacter, F2D082B = NAcharacter, F2D083B = NAcharacter, F2D084B = NAcharacter, F2D092 = NAcharacter, F2D093 = NAcharacter, F2D094 = NAcharacter, F2D102 = NAcharacter, F2D103 = NAcharacter, F2D104 = NAcharacter, F2D112 = NAcharacter, F2D122 = NAcharacter, F2D132 = NAcharacter, F2D142 = NAcharacter, F2D143 = NAcharacter, F2D144 = NAcharacter, F2D152 = NAcharacter, F2D153 = NAcharacter, F2D154 = NAcharacter, F2D162 = NAcharacter, F2D163 = NAcharacter, F2D164 = NAcharacter, F2D172 = NAcharacter, F2D173 = NAcharacter, F2D182 = NAcharacter, F2D183 = NAcharacter, F2D184 = NAcharacter, F2D174 = NAcharacter, F2D18 = NAcharacter, I = NAcharacter, F2D17 = NAcharacter, F2C10 = NAcharacter, Tuition and fees (net of allowances reported in student aid) = NAcharacter, Federal appropriations = NAcharacter, State appropriations = NAcharacter, Local appropriations = NAcharacter, Federal grants and contracts = NAcharacter, State grants and contracts = NAcharacter, Local grants and contracts = NAcharacter, Private gifts, grants, contracts, and contributions from affiliated entities = NAcharacter, Investment return (income, gains, and losses) = NAcharacter, Sales and services of educational activities = NAcharacter, Sales and services of auxiliary enterprises net allowances reported as student aid = NAcharacter, Revenues from hospitals, independent operations and other sources = NAcharacter, Total revenues and investment return = NAcharacter, Student Aid: Pell grants = NAcharacter, Student aid: Other federal grants = NAcharacter, Student aid: State grants = NAcharacter, Student aid: Local grants = NAcharacter, Student aid: Institutional grants (funded) = NAcharacter, Student aid: Institutional grants (unfunded) = NAcharacter, Total Student Aid = NAcharacter, Student aid: Portion of total student aid applied to tuition and fees = NAcharacter, Student aid: Portion of total student aid applied to auxiliary enterprises = NAcharacter, Instruction = NAcharacter, Research = NAcharacter, Public service = NAcharacter, Academic support = NAcharacter, Student services = NAcharacter, Institutional support = NAcharacter, Auxiliary enterprises = NAcharacter, Scholarships and fellowships = NAcharacter, Hospital services and independent operations (1997 only) = NAcharacter, Operations and maintenance of plant = NAcharacter, Total expenses = NAcharacter, Total revenues and investment return = NAcharacter, Total expenses = NAcharacter, Other changes in net assets = NAcharacter, Change in net assets = NAcharacter, Net assets, beginning of the year = NAcharacter, Adjustments to beginning net assets = NAcharacter, Net assets, end of the year = NAcharacter, Long-term investments = NAcharacter, Total assets = NAcharacter, Total liabilities = NAcharacter, Total unrestricted net assets = NAcharacter, Total restricted net assets = NAcharacter, Total net assets = NAcharacter, Land and land improvements-beginning of year = NAcharacter, Buildings - beginning of year = NAcharacter, Equipment, including art and library collections, beginning of year = NAcharacter, Beginning property under capital leases (not included in equipment) = NAcharacter, f2a01_1 = NAcharacter, f2a02_1 = NAcharacter, f2a04_1 = NAcharacter, f2a05_1 = NAcharacter, f2a06_1 = NAcharacter, f2a07_1 = NAcharacter, f2a08_1 = NAcharacter, f2a09_1 = NAcharacter, f2a10_1 = NAcharacter, f2a11_1 = NAcharacter, f2a12_1 = NAcharacter, f2a13_1 = NAcharacter, f2a14_1 = NAcharacter, f2a15_1 = NAcharacter, f2a16_1 = NAcharacter, f2a17_1 = NAcharacter, f2aa01 = NAcharacter, f2aa02 = NAcharacter, f2aa03 = NAcharacter, f2aa04 = NAcharacter, f2aa05 = NAcharacter, f2aa06 = NAcharacter, f2aa07 = NAcharacter, f2aa08 = NAcharacter, f2aa09 = NAcharacter, f2b01_1 = NAcharacter, f2b01_2 = NAcharacter, f2b02_1 = NAcharacter, f2b02_2 = NAcharacter, f2b03_1 = NAcharacter, f2b03_2 = NAcharacter, f2b04_1 = NAcharacter, f2b04_2 = NAcharacter, f2b05_1 = NAcharacter, f2b05_2 = NAcharacter, f2b06_1 = NAcharacter, f2b06_2 = NAcharacter, f2b07_1 = NAcharacter, f2b07_2 = NAcharacter, f2b08_1 = NAcharacter, f2b08_2 = NAcharacter, f2b09_1 = NAcharacter, f2b09_2 = NAcharacter, f2b10_1 = NAcharacter, f2b10_2 = NAcharacter, f2b11_2 = NAcharacter, f2b12_1 = NAcharacter, f2b12_2 = NAcharacter, f2b12_3 = NAcharacter, f2b12_5 = NAcharacter, f2b12_6 = NAcharacter, f2b12_7 = NAcharacter, f2c0308 = NAcharacter, f2c10 = NAcharacter, f2c11 = NAcharacter, f2c12 = NAcharacter, f2d17 = NAcharacter, f2d20 = NAcharacter, f2d23 = NAcharacter, f2d24 = NAcharacter, f2dc014 = NAcharacter, f2dc024 = NAcharacter, f2dc034 = NAcharacter, f2dc044 = NAcharacter), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame")) v <- w %>% clean_names()

names(v)[duplicated(names(v))]#> [1] "f2b01_2" "f2b02_2" "f2b03_2" "f2b04_2" "f2b05_2" "f2b06_2" "f2b07_2" u <- v %>% clean_names()

names(u)[duplicated(names(u))]#> character(0)

Created on 2018-11-16 by the reprex package https://reprex.tidyverse.org (v0.2.1)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sfirke/janitor/issues/251, or mute the thread https://github.com/notifications/unsubscribe-auth/AHOBkBrjm7lQF7iPwAxzzccbqZpUdKzXks5uv0Q8gaJpZM4YnGpo .

sfirke commented 5 years ago

Here's a smaller reprex:

y <- data.frame(
  a1 = 3,
  A1 = 1,
  a1_2 = 2
)

janitor::clean_names(y)
#>   a1 a1_2 a1_2
#> 1  3    1    2
Created on 2018-11-16 by the reprex package (v0.2.1.9000)

I don't like the duplicate names in the result.

... But I'm not sure what the ideal result would be. a1 and a1_2 feels right for the first two. And then a1_2 feels right for the third one, too. Should it become a1_3? That might be more confusing. I'm tempted to say the underlying names should be clarified before clean_names() in rare cases like this. What do you think @jzadra and others?

almartin82 commented 5 years ago

This is a great discussion! I think that the behavior of the renamer should change if it detects a collision. So in Sam's reprex the second column would become a1_a or a1_01 because it knows that mapping to a1_2 would create a collision.

On Fri, Nov 16, 2018, 10:51 PM Sam Firke <notifications@github.com wrote:

Here's a smaller reprex:

y <- data.frame( a1 = 3, A1 = 1, a1_2 = 2 )

janitor::clean_names(y)

> a1 a1_2 a1_2

> 1 3 1 2

Created on 2018-11-16 by the reprex package (v0.2.1.9000)

I don't like the duplicate names in the result.

... But I'm not sure what the ideal result would be. a1 and a1_2 feels right for the first two. And then a1_2 feels right for the third one, too. Should it become a1_3? That might be more confusing. I'm tempted to say the underlying names should be clarified before clean_names() in rare cases like this. What do you think @jzadra https://github.com/jzadra and others?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sfirke/janitor/issues/251#issuecomment-439585328, or mute the thread https://github.com/notifications/unsubscribe-auth/AAvvN3uC_1ancq_G5M4DtK6miLLAtheEks5uv4emgaJpZM4YnGpo .

billdenney commented 5 years ago

I like @almartin82's idea of different behavior when a collision occurs (my vote would be for the a1_01 option).

jzadra commented 5 years ago

Thanks for the smaller reprex, apologies for my laziness! My data has some very odd names that just happen to cause this with janitor so I'm not surprised it hasn't come up before.

Here are my thoughts on potential behavior options

Tazinho commented 5 years ago

Great catch! I also didn't see that coming.

The underlying snakecase package provides a unique_sep argument to handle these cases. If set, this triggers a call of string <- make.unique(string, sep = unique_sep) before return.

make.unique() has basically exactly the property that the result will always be unique, i.e.:

y <- data.frame(
  a1 = 3,
  A1 = 1,
  a1_2 = 2
)

library(janitor)
clean_names(y)
#>   a1 a1_2 a1_2
#> 1  3    1    2

library(magrittr)
library(snakecase)
names(y) %>% to_snake_case(unique_sep = "_", numerals = "asis")
#> [1] "a1"   "a1_1" "a1_2"

Created on 2018-11-19 by the reprex package (v0.2.0).

However, the difference in positive cases is that the first duplicate's suffix is _1 instead of _2 as janitor does. As this might be intended as well as for backward-compatibility I never touched this behaviour.

However, one option to solve this issue straight forward would be to just turn of janitors current handling and supply unique_sep = "_" in the underlying to_any_case() or equivalently change

dupe_count <- vapply(seq_along(new_names), function(i) {
        sum(new_names[i] == new_names[1:i])
    }, integer(1))
    new_names[dupe_count > 1] <- paste(new_names[dupe_count > 
        1], dupe_count[dupe_count > 1], sep = "_")
    new_names

into

make.unique(new_names)

Well knowing, that there might not be a perfect (and easy) solution available as the output can always be confusing when the input is

y2 <- data.frame(
  a1 = 3,
  A1 = 1,
  A1_ = 1,
  a1_2 = 2
)

library(janitor)
clean_names(y2)
#>   a1 a1_2 a1_3 a1_2
#> 1  3    1    1    2

library(magrittr)
library(snakecase)
names(y2) %>% to_snake_case(unique_sep = "_", numerals = "asis")
#> [1] "a1"   "a1_1" "a1_3" "a1_2"

Created on 2018-11-19 by the reprex package (v0.2.0).

Tazinho commented 5 years ago

@jzadra It would be easy to set the default handling within janitor like this:

names(y2) %>% to_snake_case(unique_sep = "_duplicate_", numerals = "asis")
#> [1] "a1"             "a1_duplicate_1" "a1_duplicate_2" "a1_2"    
Tazinho commented 5 years ago

@sfirke: any decision here? The implementation of the last suggestions would be straightforward. What do u think?

GegznaV commented 5 years ago

make_clean_names() also fails to prevent from duplicates in some situations and unique_sep = "_duplicate_" seems to be quite a robust solution. Maybe even unique_sep = "_" would be sufficient. Are there any updates on this issue?


x <-  c("name with space", "TwoWords", "TwoWords", "TwoWords_2", "TwoWords_1",
        "TwoWords_2", "TwoWords_3", "total $ (2009)")

x1 <- janitor::make_clean_names(x)

x1
#> [1] "name_with_space" "two_words"       "two_words_2"     "two_words_2"    
#> [5] "two_words_1"     "two_words_2_2"   "two_words_3"     "total_2009"

any(duplicated(x1))
#> [1] TRUE

x2 <- snakecase::to_snake_case(x, unique_sep = "_", numerals = "asis")

x2
#> [1] "name_with_space" "two_words"       "two_words_4"     "two_words_2"    
#> [5] "two_words_1"     "two_words_2_1"   "two_words_3"     "total_2009"

any(duplicated(x2))
#> [1] FALSE

Created on 2019-05-07 by the reprex package (v0.2.1)

sfirke commented 4 years ago

I agree with @GegznaV that this should be fixed, clean_names should simply not return duplicated names. The suggestions above for using the unique_sep argument of to_any_case are great and now very easy to implement since @billdenney exposed all of the snakecase::to_any_case arguments via ... !

Anyone who gets this, please share quick thoughts if you have them re: whether you think it should be

It would be added as a default argument to clean_names, changeable by the user.

y <- data.frame(
    a1 = 3,
    A1 = 1,
    a1_2 = 2
)

> y %>% clean_names
  a1 a1_2 a1_2
1  3    1    2
> y %>% clean_names(unique_sep = "_")
  a1 a1_1 a1_2
1  3    1    2
> y %>% clean_names(unique_sep = "_duplicate_")
  a1 a1_duplicate_1 a1_2
1  3              1    2

I am dreaming of submitting to CRAN by end of day tomorrow Sunday, East Coast USA time -- 24 hours from now. I know that's super fast turnaround but it's the combination of having little time to work on this AND pressure from CRAN b/c of other package updates and a base R update.

billdenney commented 4 years ago

I think that an easier solution than unique_sep may be to just repeat the duplicate checking until no columns are duplicated. That may yield names like a_1_1_1_1_1, but that seems like a reasonable solution for this type of case.

I'm not opposed to unique_sep, but if we implement it without confirming uniqueness, then at some point a person will have a column name clash with whatever is selected for unique_sep. If we don't do recursive duplicate checking, then we would need a duplicate check with a warning or error to say that a new value of unique_sep must be selected for deduplication.

sfirke commented 4 years ago

Ooh I like that recursive idea and it will have less breakage on existing janitor code. Do you have bandwidth to send a PR with that, or if not, describe here how you'd implement?

On Sun, Apr 5, 2020, 8:26 AM Bill Denney notifications@github.com wrote:

I think that an easier solution than unique_sep may be to just repeat the duplicate checking until no columns are duplicated. That may yield names like a_1_1_1_1_1, but that seems like a reasonable solution for this type of case.

I'm not opposed to unique_sep, but if we implement it without confirming uniqueness, then at some point a person will have a column name clash with whatever is selected for unique_sep. If we don't do recursive duplicate checking, then we would need a duplicate check with a warning or error to say that a new value of unique_sep must be selected for deduplication.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sfirke/janitor/issues/251#issuecomment-609408371, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZYDEEQYZCT7A22PZNQGVLRLB2H3ANCNFSM4GE4NJUA .

billdenney commented 4 years ago

I don't have the bandwidth for a PR today, my wife is working at the hospital and I have the kids (maybe tomorrow I could). I think that it would be as simple as wrapping https://github.com/sfirke/janitor/blob/576deb8beb5c546776a4ed954d4094380e701e04/R/make_clean_names.R#L135 to line 148 in

while(any(duplicated(cased_names))) {
Insert existing code
}
billdenney commented 4 years ago

Now the kids are playing happily together... If you get a PR in the next ~20 minutes, I had time. If you see nothing in an hour: child care.

billdenney commented 4 years ago

... and 17 minutes later there is a PR. :)

(And the kids now will probably need me for the rest of the day...)

sfirke commented 4 years ago

You're the best! A while loop is perfect.

On Sun, Apr 5, 2020, 9:00 AM Bill Denney notifications@github.com wrote:

... and 17 minutes later there is a PR. :)

(And the kids now will probably need me for the rest of the day...)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sfirke/janitor/issues/251#issuecomment-609412750, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZYDEB4KVKT6G6HOMPMDKTRLB6HJANCNFSM4GE4NJUA .