Open ax42 opened 9 years ago
Ok, I've done some ghetto profiling, and I think the line
xls <- lapply(lapply(xns, xmlToList), enlist)
in rforcecom.query()
is the issue. Using a query which takes overall 4.2 seconds (427 records, so no need to refetch), around 1 second is for the data transfer (which is the same if I use another REST tool), and 2.2 seconds are used for the line above (so 2.2x the time the data transfer takes).
Creating a query which returns more records (~1'400) makes the discrepancy worse:
I'd be interested in trying to improve this. Is there a faster XML library? Does one really need to convert the whole XML to a list? Would JSON be quicker?
More updates:
OK, I've done some further benchmarking -- it seems that we can get some good speed increases if we use curl
instead or RCurl
(~50% speed improvement on a small test) and xml2
instead of XML
.
@hiratake55 -- I think refactoring the whole library will be a challenge to do in one shot, what is your appetite to do this one function at a time? It will mean requiring both xml libraries and both curl libraries (as some functions are likely to use the one, and others the other.
Hi @ax42 ,
Thank you for contacting me, I'll check xml2 package fits RForcecom or not.
@ax42 Have you considered using the Bulk API features of the package? Below are some timings with roughly 650K records and I've pulled 50K in a second or two. Salesforce caches the queries, so they become faster if you repeat them.
Note: rforcecom.bulkQuery is a convenience wrapper I've written around rforcecom.submitBulkQuery
and friends, but I'll submit PR today and hopefully have it be included in package as standard function. I'll post its code here as well, so you have it.
# before salesforce caching
# Salesforce runtime = 69 seconds, so 11 second overhead from R
system.time(d <- rforcecom.bulkQuery(session,
soqlQuery = "Select Id from Account",
object = 'Account',
interval_seconds=1,
max_attempts=100,
verbose=FALSE))
user system elapsed
0.24 0.05 80.52
dim(d)
[1] 644239 1
# after salesforce caching
# Salesforce runtime = 16 seconds, so 10 second overhead from R
system.time(d <- rforcecom.bulkQuery(session,
soqlQuery = "Select Id from Account",
object = 'Account',
interval_seconds=1,
max_attempts=100,
verbose=FALSE))
user system elapsed
0.24 0.05 26.17
dim(d)
[1] 644239 1
rforcecom.bulkQuery
rforcecom.bulkQuery <- function(session,
soqlQuery,
object,
interval_seconds=5,
max_attempts=100,
verbose=FALSE){
job_info <- rforcecom.createBulkJob(session, operation='query', object=object)
batch_query_info <- rforcecom.submitBulkQuery(session,
jobId=job_info$id,
query=soqlQuery)
status_complete <- FALSE
z <- 1
Sys.sleep(interval_seconds)
while (z < max_attempts & !status_complete){
if (verbose){
message(paste0("Pass #", z))
}
Sys.sleep(interval_seconds)
batch_query_status <- rforcecom.checkBatchStatus(session,
jobId=batch_query_info$jobId,
batchId=batch_query_info$id)
status_complete <- (batch_query_status$state=='Completed')
z <- z + 1
}
if (!status_complete) {
message(paste('Issue with batches submitted', print(proceed_on_batches)))
batch_query_details <- NULL
tryCatch({
batch_query_details <- rforcecom.getBatchDetails(session,
jobId=batch_query_info$jobId,
batchId=batch_query_info$id)
}, error=function(e){
})
# close the job
close_job_info <- rforcecom.closeBulkJob(session, jobId=job_info$id)
return(batch_query_details)
}
batch_query_details <- rforcecom.getBatchDetails(session,
jobId=batch_query_info$jobId,
batchId=batch_query_info$id)
batch_query_recordset <- rforcecom.getBulkQueryResult(session,
jobId=batch_query_info$jobId,
batchId=batch_query_info$id,
resultId=batch_query_details$result)
close_job_info <- rforcecom.closeBulkJob(session, jobId=job_info$id)
return(batch_query_recordset)
}
That's cool, thanks. I'd not seen the bulk API pieces as I seem to be using v0.7 (off CRAN).
One thing the bulk query does not seem to be able to do is deal with foreign keys (e.g. fetching the details of an Account owner). SalesForce returns:
Foreign Key Relationships not supported in Bulk Query
So it's probably really useful in some situations (straight dumps) but not in others (complex queries), although it may be faster to pull straight dumps off SF and combine them in R than run complex queries.
Yes, for large joins I would recommend pulling straight dumps of each object and joining in R (I use the dplyr
package for joins). For small sets I would recommend using the SOAP-based function rforcecom.query
just to cut down on R code. I haven't run performance tests to see who does the join faster on large datasets (R vs. Salesforce), but would be interesting to know.
If you want to experiment with some of the Bulk functions (since it's not on CRAN yet) you can install from the maintainer's Github or mine
Github Install
library(devtools)
install_github('ReportMort/RForcecom')
I've installed the github version from the maintainer, and just copy/pasted the code you kindly provided. I'll try and get some benchmarks done over the next few days (although it seems our SalesForce instance is a lot smaller than yours).
My workflow so far has been to use http://dataloader.io to help me formulate my queries, and then I run them with rforcecom.query
, which is why I have been focusing on benchmarking and/or improving that function specifically. Moving to xml2
should already make a huge difference.
The advantage of letting SalesForce do the joins is that you don't have to worry about consistency and you always get your dataset back exactly like you want it (especially if you are calling a bunch of lookup fields in a query). Each approach has its applications in the right place, and having both available in the library is great!
Any update on this, my queries from SFDC are extremely slow thus far.
Hi
RForceCom feels slow -- it's taking about 4 minutes to execute a query which returns about 18'500 records with 9 variables.
I've not started tracing through this to figure out where the issue could be (SF? Network? XML vs JSON?). What's the best way to start digging into this issue?