joelgombin / banR

R client for the BAN API
http://joelgombin.github.io/banR/
GNU General Public License v3.0
28 stars 10 forks source link

Searching non-ASCII characters in geocode_tbl() results in Unicode characters in the result #35

Open JMPivette opened 3 years ago

JMPivette commented 3 years ago

I guess this issue is not directly linked to banR but to the underlying API.

If one of the searched address contains non-ASCII characters we end up with Unicode characters in the results instead of UTF-8. (\xe2 instead of â for example):

In the following example using évron instead of evron results in a different encoding for my second search (Chatelaillon).

location_tbl <- tibble::tibble(city = c("évron", "Chatelaillon"))
banR::geocode_tbl(location_tbl, city) 
#> Writing tempfile to.../var/folders/dc/9dbfr9sx23jcx1tmfdlxqr3m0000gq/T//RtmpyiSyId/filef0946d58acb.csv
#> If file is larger than 8 MB, it must be splitted
#> Size is : 25 bytes
#> SuccessOKSuccess: (200) OK
#> # A tibble: 2 x 17
#>   city   latitude longitude result_label      result_score result_type result_id
#>   <chr>     <dbl>     <dbl> <chr>                    <dbl> <chr>       <chr>    
#> 1 évron      45.5      4.57 "Rue a C Victime…         0.2  street      42103_o6…
#> 2 Chate…     46.1     -1.09 "Ch\xe2telaillon…         0.62 municipali… 17094    
#> # … with 10 more variables: result_housenumber <chr>, result_name <chr>,
#> #   result_street <chr>, result_postcode <chr>, result_city <chr>,
#> #   result_context <chr>, result_citycode <chr>, result_oldcitycode <chr>,
#> #   result_oldcity <chr>, result_district <chr>

location_tbl <- tibble::tibble(city = c("evron", "Chatelaillon"))
banR::geocode_tbl(location_tbl, city)
#> Writing tempfile to.../var/folders/dc/9dbfr9sx23jcx1tmfdlxqr3m0000gq/T//RtmpyiSyId/filef096d8b39c1.csv
#> If file is larger than 8 MB, it must be splitted
#> Size is : 24 bytes
#> SuccessOKSuccess: (200) OK
#> # A tibble: 2 x 17
#>   city     latitude longitude result_label    result_score result_type result_id
#>   <chr>       <dbl>     <dbl> <chr>                  <dbl> <chr>       <chr>    
#> 1 evron        48.1    -0.425 Évron                   0.94 municipali… 53097    
#> 2 Chatela…     46.1    -1.09  Châtelaillon-P…         0.62 municipali… 17094    
#> # … with 10 more variables: result_housenumber <chr>, result_name <chr>,
#> #   result_street <chr>, result_postcode <chr>, result_city <chr>,
#> #   result_context <chr>, result_citycode <chr>, result_oldcitycode <chr>,
#> #   result_oldcity <chr>, result_district <chr>
JMPivette commented 3 years ago

I found the underlying issue here: https://github.com/etalab/adresse.data.gouv.fr/issues/622

So it happens only when there are less than 5 rows in the tibble and there are non-ASCII characters.

For information, my workaround so far is to rename my search using stringi::stri_trans_general(id = "Latin-ASCII")