tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
622 stars 60 forks source link

vroom fails to detect delimiter for some CSV file #498

Closed hidekoji closed 1 year ago

hidekoji commented 1 year ago

vroom::vroom fails to detect delimiter for the following CSV file.

vroom::vroom(file = "https://www.dropbox.com/s/pben6bg3n7igoy8/member_list_test2.csv?dl=1")
#> Warning in gsub("\"[^\"]*\"", "", lines): unable to translate
#> 'SG000000,TEST<83>X<81>[<83>p<81>[_A<93>X_<83>J<83><8c><81>[,,test@test.com,,,,,,,,,,,,,,,,,999999,999999001,999999201,,,,,746,,,'
#> to a wide string
#> Error in gsub("\"[^\"]*\"", "", lines): input string 2 is invalid

Created on 2023-05-29 with reprex v2.0.2

it works if I explicitly pass the delim argument with ",".

vroom::vroom(file = "https://www.dropbox.com/s/pben6bg3n7igoy8/member_list_test2.csv?dl=1", delim = ",")
#> Rows: 8 Columns: 31
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (5): login_id, shop_member_name, mail, uniqueDeviceId, description
#> dbl  (4): signage_chain_code, signage_shop_code, signage_location_code, stb_...
#> lgl (22): password, mail_flg_chr, department, officer_name, post_code, addre...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 8 × 31
#>   login_id shop_member_name  password mail  mail_flg_chr department officer_name
#>   <chr>    <chr>             <lgl>    <chr> <lgl>        <lgl>      <lgl>       
#> 1 SG000000 "TEST\x83X\x81[\… NA       test… NA           NA         NA          
#> 2 SG000001 "TEST\x83X\x81[\… NA       test… NA           NA         NA          
#> 3 SG000002 "TEST\x83X\x81[\… NA       test… NA           NA         NA          
#> 4 SG000003 "TEST\x83X\x81[\… NA       <NA>  NA           NA         NA          
#> 5 SG000004 "TEST\x83X\x81[\… NA       <NA>  NA           NA         NA          
#> 6 SG000005 "TEST\x83X\x81[\… NA       test… NA           NA         NA          
#> 7 SG000006 "TEST\x83X\x81[\… NA       <NA>  NA           NA         NA          
#> 8 SG000007 "TEST\x83X\x81[\… NA       <NA>  NA           NA         NA          
#> # ℹ 24 more variables: post_code <lgl>, address1 <lgl>, address2 <lgl>,
#> #   tel1 <lgl>, tel2 <lgl>, access_ip_address <lgl>, maker_code <lgl>,
#> #   delivery_chain_and_location_codes <lgl>, delivery_shop_code <lgl>,
#> #   shop_chain_code <lgl>, shop_location_codes <lgl>, opening_time <lgl>,
#> #   closing_time <lgl>, signage_chain_code <dbl>, signage_shop_code <dbl>,
#> #   signage_location_code <dbl>, uniqueDeviceId <chr>,
#> #   signage_equipment_id <lgl>, chain_code <lgl>, chain_location_codes <lgl>, …

Created on 2023-05-29 with reprex v2.0.2

hadley commented 1 year ago

This works for me:

vroom::vroom(file = "https://www.dropbox.com/s/pben6bg3n7igoy8/member_list_test2.csv?dl=1")
#> Rows: 8 Columns: 31
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (5): login_id, shop_member_name, mail, uniqueDeviceId, description
#> dbl  (4): signage_chain_code, signage_shop_code, signage_location_code, stb_...
#> lgl (22): password, mail_flg_chr, department, officer_name, post_code, addre...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 8 × 31
#>   login_id shop_member_name  password mail  mail_flg_chr department officer_name
#>   <chr>    <chr>             <lgl>    <chr> <lgl>        <lgl>      <lgl>       
#> 1 SG000000 "TEST\x83X\x81[\… NA       test… NA           NA         NA          
#> 2 SG000001 "TEST\x83X\x81[\… NA       test… NA           NA         NA          
#> 3 SG000002 "TEST\x83X\x81[\… NA       test… NA           NA         NA          
#> 4 SG000003 "TEST\x83X\x81[\… NA       <NA>  NA           NA         NA          
#> 5 SG000004 "TEST\x83X\x81[\… NA       <NA>  NA           NA         NA          
#> 6 SG000005 "TEST\x83X\x81[\… NA       test… NA           NA         NA          
#> 7 SG000006 "TEST\x83X\x81[\… NA       <NA>  NA           NA         NA          
#> 8 SG000007 "TEST\x83X\x81[\… NA       <NA>  NA           NA         NA          
#> # ℹ 24 more variables: post_code <lgl>, address1 <lgl>, address2 <lgl>,
#> #   tel1 <lgl>, tel2 <lgl>, access_ip_address <lgl>, maker_code <lgl>,
#> #   delivery_chain_and_location_codes <lgl>, delivery_shop_code <lgl>,
#> #   shop_chain_code <lgl>, shop_location_codes <lgl>, opening_time <lgl>,
#> #   closing_time <lgl>, signage_chain_code <dbl>, signage_shop_code <dbl>,
#> #   signage_location_code <dbl>, uniqueDeviceId <chr>,
#> #   signage_equipment_id <lgl>, chain_code <lgl>, chain_location_codes <lgl>, …

Created on 2023-08-01 with reprex v2.0.2