paws-r / paws

Paws, a package for Amazon Web Services in R
https://www.paws-r-sdk.com
Other
315 stars 37 forks source link

Kinesis client putrecord issue? #396

Open edgBR opened 3 years ago

edgBR commented 3 years ago

Dear colleagues,

I am trying to run an Rscript in an EC2 instance that I am using as producer.

The code is as follows and is based in the ML specialist cloud guru code (https://github.com/ACloudGuru-Resources/Course_AWS_Certified_Machine_Learning/blob/master/Chapter3/put-record-python-program.py):

library(httr)
library(paws)
library(jsonlite)
library(lubridate)
library(uuid)
library(dplyr)

client = paws::kinesis(config = list(region = "us-east-1"))
partition_key <- uuid::UUIDgenerate(n=1)

# Added 08/2020 since randomuser.me is starting to throttle API calls
# The following code loads 500 random users into memory
number_of_results <- 500

request <-
  httr::GET(url = paste0('https://randomuser.me/api/?exc=login&results=', number_of_results))
data <- request %>% content()
data <- data$results

while (TRUE) {
  # The following chooses a random user from the 500 random users pulled from the API in a single API call.
  random_user_index <-
    runif(n = 1, min = 0, max = number_of_results - 1) %>% as.integer()
  random_user <- data[random_user_index]
  random_user <- toJSON(random_user)
  client$put_record(StreamName = "my_stream",
                    Data = random_user,
                    PartitionKey = partition_key)
  Sys.sleep(runif(n = 1, min = 0, max = 1))

}

However I am getting the following error:

Error in file(what, "rb") : cannot open the connection
Calls: <Anonymous> ... convert_blob -> raw_to_base64 -> <Anonymous> -> file
In addition: Warning message:
In file(what, "rb") :
  cannot open file '[{"gender":["female"],"name":{"title":["Madame"],"first":["Alisha"],"last":["Denis"]},"location":{"street":{"number":[7221],"name":["Rue du Village"]},"city":["Hüttlingen"],"state":["Basel-Landschaft"],"country":["Switzerland"],"postcode":[6851],"coordinates":{"latitude":["-25.4605"],"longitude":["88.3460"]},"timezone":{"offset":["+6:00"],"description":["Almaty, Dhaka, Colombo"]}},"email":["alisha.denis@example.com"],"dob":{"date":["1947-06-02T01:05:06.443Z"],"age":[74]},"registered":{"date":["2019-02-10T05:22:41.058Z"],"age":[2]},"phone":["077 098 25 01"],"cell":["075 948 41 92"],"id":{"name":["AVS"],"value":["756.4495.7678.82"]},"picture":{"large":["https://randomuser.me/api/portraits/women/75.jpg"],"medium":["https://randomuser.me/api/portraits/med/women/75.jpg"],"thumbnail":["https://randomuser.me/api/portraits/thumb/women/75.jpg"]},"nat":["CH"]}]': File name too long
Execution halted

Looking to the second line of error makes me wonder if I need an intermediate file for doing this operation. I have tried also to use jsonlite::base64_enc but it also does not work.

Could someone point out what are the issues here?

Attaching sessionInfo:

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-koji-linux-gnu (64-bit)
Running under: Amazon Linux 2

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] dplyr_1.0.5      uuid_0.1-4       lubridate_1.7.10 jsonlite_1.7.2
[5] paws_0.1.10      httr_1.4.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6       fansi_0.4.2      crayon_1.4.1     utf8_1.1.4
 [5] R6_2.5.0         lifecycle_1.0.0  magrittr_2.0.1   pillar_1.5.1
 [9] rlang_0.4.10     vctrs_0.3.6      generics_0.1.0   ellipsis_0.3.1
[13] glue_1.4.2       purrr_0.3.4      compiler_4.0.2   pkgconfig_2.0.3
[17] tidyselect_1.1.0 tibble_3.1.0
>
davidkretch commented 3 years ago

Sorry about that. Currently the Data parameter expects a binary object, e.g. charToRaw(random_user), but we'll try to fix it so it works like the Python SDK. The update won't be on CRAN for a couple weeks though because we just updated and we're limited to one update every 30 days.

edgBR commented 3 years ago

Hi @davidkretch,

I think unifiying the working way towards boto3 will help a lot and will also increase the adaptation of paws.

Do you have a list of which methods of paws are expecting a binary object?

BR /Edgar

davidkretch commented 3 years ago

There is unfortunately not a list of methods that expect a binary object. Paws is generated from AWS's own API definitions, and it is whenever they expect a binary object. For example, in Kinesis's put_record operation, the documentation states:

"The data blob to put into the record, which is base64-encoded when the blob is serialized. When the data blob (the payload before base64-encoding) is added to the partition key size, the total size must not exceed the maximum record size."

Python's SDK is obviously being more helpful in this case, in that you don't have to provide the blob yourself. We'll need to look into what Python is doing -- different services might have different needs. S3 for example has similar requirements, but in that case Paws has a custom way of handling them that is particular to S3.