QueueDataWarehouse throws error with enqueueOnly = FALSE

jdeboer commented 7 years ago

Hi,

I can perform the following Data Warehouse query successfully with enqueueOnly = TRUE, where the report is received via FTP. But when enqueueOnly = FALSE I receive an error as shown below:

report <- QueueDataWarehouse(
  reportsuite.id = <hidden>,
  date.from = "2017-01-08",
  date.to = "2017-01-09",
  metrics = c("visits", "pageViews"),
  elements = c("visitorid", "ipaddress", "visitnumber",
                        "returnfrequency", "entrypage", "exitpage"),
  date.granularity = "hour",
  ftp = list(
   host = <hidden>,
   port = "21",
   directory = "/",
   username = <hidden>,
   password = <hidden>,
   filename = "RSiteCatalyst.csv"
  ),
  enqueueOnly = FALSE
)

[1] "Requesting URL attempt #1"
[1] "Requesting URL attempt #2"
[1] "Requesting URL attempt #3"
[1] "Requesting URL attempt #4"
[1] "Requesting URL attempt #5"
[1] "Requesting URL attempt #6"
[1] "Received  report."
Error in names(working.metrics) <- metrics : 
  'names' attribute [4] must be the same length as the vector [3]

A traceback of the error shows the following:

9: BuildInnerBreakdownsRecursively(working.element, elements, metrics, 
       current.recursion.level + 1, context, accumulator, date.range = date.range)
8: BuildInnerBreakdownsRecursively(working.element, elements, metrics, 
       current.recursion.level + 1, context, accumulator, date.range = date.range)
7: BuildInnerBreakdownsRecursively(working.element, elements, metrics, 
       current.recursion.level + 1, context, accumulator, date.range = date.range)
6: BuildInnerBreakdownsRecursively(working.element, elements, metrics, 
       current.recursion.level + 1, context, accumulator, date.range = date.range)
5: BuildInnerBreakdownsRecursively(data, elements, metrics, 1, c())
4: ParseDW(report.data)
3: GetReport(report.id, interval.seconds = interval.seconds, max.attempts = max.attempts, 
       print.attempts = TRUE)
2: SubmitJsonQueueReport(toJSON(report.description), interval.seconds = interval.seconds, 
       max.attempts = max.attempts, validate = validate, enqueueOnly = enqueueOnly)
1: QueueDataWarehouse(reportsuite.id = <hidden>, date.from = "2017-01-08", 
       date.to = "2017-01-09", metrics = c("visits", "pageViews"), 
       elements = c("visitorid", "ipaddress", "visitnumber",
           "returnfrequency", "entrypage", "exitpage"), date.granularity = "hour", 
       ftp = list(host = <hidden>, port = "21", directory = "/", 
           username = <hidden>, password = <hidden>, filename = "RSiteCatalyst.csv"), 
       enqueueOnly = FALSE)

randyzwitch commented 7 years ago

Thanks, this is pretty much #196 in that enqueueOnly doesn't really work. Need to figure out how to parse really complex files.

jdeboer commented 7 years ago

For the internal RSiteCatalyst:::ParseDW function, I'm thinking that either tidyr::unnest or jsonlite::flatten might be helpful.

Because the nested data frames returned by the API have the same column names ('breakdown' and 'counts') at each level, it seems that 'jsonlite::flatten' doesn't work as expected as the flattened columns are overwritten (perhaps a bug with jsonlite?)

With tidyr::unnest, the following might work for unnesting the breakdown elements (dimensions), but I'm not quite sure how to unnest the counts as they would need to be converted into additional columns as opposed to rows:

  # Recursively unnest until there are no more lists in the data frame
  formatted.df <- data
  while(any(sapply(formatted.df, class) == "list")) {
    formatted.df <- tidyr::unnest(formatted.df, breakdown)
  }

jdeboer commented 7 years ago

Adding to my comment above, I've looked further into how to unnest the counts. For this, I propose using plyr::ldply with all optional arguments set to their defaults; i.e. plyr::ldply(counts) .

randyzwitch commented 7 years ago

Thanks for plugging away at this @jdeboer! Your approach sounds good, I'm definitely of the mind that any working functionality is better than a broken one.

jdeboer commented 7 years ago

Thanks Randy. An alternative approach I'm considering is to recursively rename the 'name' columns of the nested 'counts' and 'breakdown' data frames so that they are each distinct. The hope is that jsonlite::flatten will then work in the desired way. Ideally I would aim to use the names of the actual distinct metrics, elements and chosen granularity level, where appropriate.

jdeboer commented 7 years ago

I've come across an issue with the Adobe Data Warehouse API feed.

This is an example of what I might see from the data report field if I do not set a granularity:

[{
  "breakdown": [{
    "name": "1104789101982705436_424952459607984298",
    "counts": ["0", "1"]
   }, {
    "name": "1142707252089472126_5298667785999128894",
    "counts": ["0", "1"]
   }, {
    "name": "1212797736425383838_4662671777624316307",
    "counts": ["0", "1"]
   }
  ],
  "name": "www.example-domain.com:home_page"
 }
]

Whereas, here is an example of what I might see from the data report field if I do set granularity:

[{
  "name": "January 17, 2017, Hour 0",
  "breakdown": [{
    "name": "www.example-domain.com:home_page",
    "counts": [
     "1104789101982705436_424952459607984298", "0", "1",
     "1142707252089472126_5298667785999128894", "0", "1",
     "1212797736425383838_4662671777624316307", "0", "1"
    ]
   }
  ]
 }
]

Notice the counts field now has the visitor ID and the page view counts together as a single array, rather than creating separate 'breakdown' levels for visitor and page.

jdeboer commented 7 years ago

I'm going to attempt to get the report from the API in CSV format instead of JSON, and then coerce it into a data frame.

randyzwitch commented 7 years ago

If CSV is an option, then I've been doing this the hard way for years!

On Jan 18, 2017, at 5:41 AM, Johann de Boer notifications@github.com wrote:

I'm going to attempt to get the report from the API in CSV format instead of JSON, and then coerce it into a data frame.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

jdeboer commented 7 years ago

Yes, I read about this option from the API documentation: https://marketing.adobe.com/developer/pt/documentation/data-warehouse/r-report-2 (refer to the section "Results via API").

I've just had a go and it seems to work well.

In the GetReport function, I propose adding an argument, format, with a default of "json", but accepting"csv" as well. Then I propose adding request.body$format <- unbox(format) to the function body prior to performing the API request.

To parse the CSV response from the API, captured in response_content, I recommend using functions from the stringr package and using read_csv from the readr package, in parsing the response as follows:

data <- readr::read_csv(str_replace(response_content, "^<U\\+FEFF>", ""))

This will remove the byte order mark <U+FEFF> from the CSV response prior to parsing the response into a data frame.

randyzwitch commented 7 years ago

I pretty indifferent to the implementation, except for adding a readr dependency rather than just using the base method (unless it is truly necessary). Are these changes something you are comfortable submitting a pull request for, or is it something you'd need me to help with/do myself?

On Jan 18, 2017, at 5:22 PM, Johann de Boer notifications@github.com wrote:

Yes, I read about this option from the API documentation: https://marketing.adobe.com/developer/pt/documentation/data-warehouse/r-report-2 (refer to the section "Results via API").

I've just had a go and it seems to work well.

In the GetReport function, I propose adding an argument, format, with a default of "json", but accepting"csv" as well. Then I propose adding request.body$format <- unbox(format) to the function body prior to performing the API request.

To parse the CSV response from the API, captured in response_content, I recommend using functions from the stringr package and using read_csv from the readr package, in parsing the response as follows:

data <- readr::read_csv(str_replace(response_content, "^<U\+FEFF>", ""))

This will remove the byte order mark <U+FEFF> from the CSV response prior to parsing the response into a data frame.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

jdeboer commented 7 years ago

I'll give it a shot, but it would be great to get your input around handling dependencies on the assumed JSON object response used in the package, as requesting a CSV response would cause those dependencies to break. I.e. checking for an error message returned from the API, determining the returned report type, and checking for a segment ID in the response.

jdeboer commented 7 years ago

OK, I think I've sorted this out now and have submitted a pull request with the proposed code changes. @wmcraver, once @randyzwitch has reviewed and merged the pull request, it would be great if could you please check that the changes also resolve #196 for you. Alternative, if you want to check sooner, I've pushed the changes to my RSiteCatalyst fork which you can install via devtools::install_github("jdeboer/RSiteCatalyst")

randyzwitch commented 7 years ago

@wmcraver and @jdeboer, I've merged into master what I hope is the last commit before releasing this bug fix. If you could, please test the master branch using devtools and let me know how it works, so I can submit this to CRAN

wmcraver commented 7 years ago

@randyzwitch & @jdeboer when I use devtools::install_github("randyzwitch/RSiteCatalyst") I get the following error: ERROR: dependency 'digest' is not available for package 'RSiteCatalyst'. Is there another way to pull this?

randyzwitch commented 7 years ago

If you are using RSiteCatalyst already, I would think digest would be installed? Can digest be installed outside of RSiteCatalyst?

On Feb 1, 2017, at 6:11 PM, Mitch notifications@github.com wrote:

@randyzwitch & @jdeboer when I use devtools::install_github("randyzwitch/RSiteCatalyst") I get the following error: ERROR: dependency 'digest' is not available for package 'RSiteCatalyst'. Is there another way to pull this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

wmcraver commented 7 years ago

I'm not sure what the deal was with my work computer. I reinstalled digest and was able to install RSiteCatalyst from your repo. The QueueDataWarehouse() function is now pulling results and storing them in a variable, but I am unable to replicate my DW CSV that is emailed every morning.

When I request:

newX = QueueDataWarehouse(rs,
                          today()-1,
                          today()-1,
                          metrics = "orders",
                          element = c("evar127",
                                               "evar23",
                                               "evar25",
                                               "evar26",
                                               "lasttouchchanneldetail",
                                               "evar55"),
                          enqueueOnly = F)

I receive 15,156 records. However, after inspecting them, eVar127 (order number) has no values. The Data Warehouse CSV contains 55,778 rows. 47,560 records without eVar127 and 8,217 with.

randyzwitch commented 7 years ago

Thanks for testing. Here I told @jdeboer that paging through the results probably wouldn't be necessary to implement, and your first test shows that it is important. :)

So, it's great that you got some data back (i.e. the original error is solved), but downloading all the pages should be implemented for release. I'll try and do that soon.

randyzwitch commented 7 years ago

New error:

> library(RSiteCatalyst)
> report.id <- QueueDataWarehouse("zwitchdev",
+                                 "2014-01-01",
+                                 "2017-02-02",
+                                 c("visits", "pageviews", "instances"),
+                                 c("page", "browser"),
+                                 enqueueOnly=TRUE
+ )
 Hide Traceback

 Rerun with Debug
 Error in data.frame(ftp) : 
  promise already under evaluation: recursive default argument reference or earlier problems? 
3.
data.frame(ftp) 
2.
unbox(data.frame(ftp)) at QueueDataWarehouse.R#113
1.
QueueDataWarehouse("zwitchdev", "2014-01-01", "2017-02-02", c("visits", 
    "pageviews", "instances"), c("page", "browser"), enqueueOnly = TRUE)

randyzwitch commented 7 years ago

@wmcraver, hopefully this is the last time I need you to test! I've now added paging for data warehouse results, with a test returning a few hundred-thousand. If you could, let me know if this gets you what you need, so I can close this issue and also post a new CRAN release since this is a semi-serious bug

wmcraver commented 7 years ago

QueueDataWarehouse() is now returning the same number of results from Adobe. In my case, the test returned 54,935 records. However, in my request to Adobe, I requested an eVar that contains an order number that is a concatenation of order number and date, separated by a pipe. Example: 1234567890|2017-02-23. The rows that should have a confirmation number instead have a number starting at 1 and incrementing all the way to 6,080. The first 11,733 records have a value of 1 for this eVar and then it appears to increment by 1 for each row.

My request was: newX = QueueDataWarehouse(rs, today()-1, today()-1, metrics = "orders", element = c("evar127", "evar23", "evar25", "evar26", "lasttouchchanneldetail", "evar55"), enqueueOnly = F)

evar127 = order number

Please let me know if you need any additional information on this.

Thank you for your work on this package!

randyzwitch commented 7 years ago

Hahaha, what? :)

If you could, run SCAuth with debug.mode = TRUE, so that the .csv files will be saved locally on your computer. Then, let me know if those files are correct or not, so that I know if the RSiteCatalyst parser is somehow broken or if the API is not returning the expected results. I can't even imagine how this could get messed up by RSiteCatalyst, but if you have a pipe, maybe it's messing with the R csv reader.

wmcraver commented 7 years ago

Tested again with debug.mode = T and checked the CSV files--they are correct.

randyzwitch commented 7 years ago

I just pushed a fix to master that might fix this, I think it could be an issue with the file encoding. If you could pull from master, re-run the QDW function once more, hopefully this fixed the problem once and for all.

wmcraver commented 7 years ago

I updated and processed the request again. Same result as before.

randyzwitch commented 7 years ago

If switching the file encoding on read.csv doesn't work, I'm not sure what else there is to try.

wmcraver commented 7 years ago

I notice that when I read the files in manually, using data = read.csv("PostRequest_1487960697.02758.csv", fileEncoding = "UTF-8-BOM"), the order numbers come in just fine. The rows are also read in when UTF-8 is set for fileEncoding.

randyzwitch commented 7 years ago

Thanks. Can you pull from master one more time? I think I did this commit right after telling you to try again, so maybe you were too diligent :)

https://github.com/randyzwitch/RSiteCatalyst/blob/master/R/ApiRequest.R#L107

wmcraver commented 7 years ago

I tried updating again but it looks like I'm was using the current version. I forced a new install and tried again. Same result.

randyzwitch commented 7 years ago

Unfortunately, not sure what the solution is then. We both came to the same conclusion, and there's only one place in the code where read.csv is being called.

For now, I'm tempted to mark this as fixed, as it was really for paging results and making enqueueOnly = FALSE works. If it continues to be an issue with other extracts, let me know. I'll try and implement a variable on my side with a pipe in it, so that I can do more testing.

randyzwitch / RSiteCatalyst

QueueDataWarehouse throws error with enqueueOnly = FALSE #198