ropensci / elastic

R client for the Elasticsearch HTTP API
https://docs.ropensci.org/elastic
Other
245 stars 58 forks source link

Invalid UTF-8 start byte error in docs_bulk.data.frame #223

Closed Lchiffon closed 6 years ago

Lchiffon commented 6 years ago

elastic search Version: 6.3.0

Session Info ```r R version 3.4.1 (2017-06-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 Matrix products: default locale: [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936 [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936 [4] LC_NUMERIC=C [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] elastic_0.8.2.9326 loaded via a namespace (and not attached): [1] httr_1.2.1 compiler_3.4.1 R6_2.2.2 tools_3.4.1 [5] yaml_2.1.14 curl_3.0 jsonlite_1.5 ```

In Windows system, there will be some problems in encoding (windows default encoding for Chinese character is GB2312, but elastic use UTF-8) When data has a character object in GB2312, docs_bulk will return an Invalid UTF-8 start byte error error like:

code:

library(elastic)
connect()

a = data.frame(a= '测试', b = 123)
elastic::index_create(index = "test", verbose = TRUE)
elastic::docs_bulk(a, index = "dianping")

return:

[[1]]
[[1]]$took
[1] 1

[[1]]$errors
[1] TRUE

[[1]]$items
[[1]]$items[[1]]
[[1]]$items[[1]]$index
[[1]]$items[[1]]$index$`_index`
[1] "dianping"

[[1]]$items[[1]]$index$`_type`
[1] "dianping"

[[1]]$items[[1]]$index$`_id`
[1] "bWU6F2QBfgMgeBf7pgo-"

[[1]]$items[[1]]$index$status
[1] 400

[[1]]$items[[1]]$index$error
[[1]]$items[[1]]$index$error$type
[1] "mapper_parsing_exception"

[[1]]$items[[1]]$index$error$reason
[1] "failed to parse"

[[1]]$items[[1]]$index$error$caused_by
[[1]]$items[[1]]$index$error$caused_by$type
[1] "json_parse_exception"

[[1]]$items[[1]]$index$error$caused_by$reason
[1] "Invalid UTF-8 start byte 0xb2\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@220e59db; line: 1, column: 8]"
sckott commented 6 years ago

thanks! looking at the PR

sckott commented 6 years ago

fixed in #224