Closed cboettig closed 6 years ago
@cboettig The Query and QueryResult classes are low level. Instead of refactoring these classes, it may be useful to create a higher level function/class that calls Query/QueryResults to perform a query and accumulate all the results into a desired format.
hmm... looks like the serialization return types are part of the W3C spec and the existing redland implementation though, so I wasn't sure I should be writing a separate thing. On the other hand, I see your point about it being a rather low-level return object anyhow.
I'm working on some higher-level wrappers around the redland package, just to simplify common calls (and also make them work with JSON-LD data) at https://github.com/cboettig/rdflib (figured it made the most sense to do this as a separate package).
I've added a function that attempts to tabularize the results of the query now, to return a data.frame instead, which is probably preferred by R users.
Currently my function trims the duck types (^^
) off of the strings and returns everything as characters; ideally it should be coercing them into the native R types, but there's no guarantee we can parse the declared type and successfully convert it, so from a practical standpoint I think it is easier to leave that to the user who can handle the type conversions pretty easily once things are in a data.frame (i.e. as we do all the time working from our favorite csv
format...)
See https://github.com/cboettig/rdflib/blob/master/R/rdf.R#L203-L212
@cboettig your rdflib package looks really useful. You might consider using the freeQuery()
and
freeQueryResults()
methods as the redland
package calls to the redland C library manage
memory manually which needs to be released when no longer needed. This is one of the reasons I still want to put a higher level query function in the redland
package, that would take care of this kind of house keeping.
Yeah, good call, I wasn't sure if any of that was built in. Ideally I think there'd be some link into the R garbage collection would take care of this, i.e. once a pointer was removed from the R environment? (For instance, I think the xml2
package does this?)
it is probably need that on the parser/serializers as well.
But maybe it would be sufficient, if not maximally computationally efficient, to just run the free
calls once the function is done collecting the results from the query.
Probably less obvious what to do in the serialize/parse case.
@cboettig i'll have a look at how xml2 frees memory - this happens with the xml_remove()
function.
@gothub Looks like xml2
frees memory by taking advantage of helper utilities built into RCpp, see https://www.r-bloggers.com/external-pointers-with-rcpp/, but I'm not sure if that'll do any good for the redland C bindings.
However, @richfitz pointed me to the official R-ext docs on external pointers, which, if I've understood him correctly, will do much the same thing to handle removal of pointers via garbage collection. See: https://cran.r-project.org/doc/manuals/r-release/R-exts.html#External-pointers-and-weak-references
Rich has a working example of this in his packages, e.g.
here's an example from one of mine: https://github.com/richfitz/thor/blob/ff82d7/src/thor.c#L524-L530 (use) and https://github.com/richfitz/thor/blob/ff82d7/src/thor.c#L569-L575 (definition) - there's a declaration at the top of the file too. Follows the pattern in R-exts pretty closely but with the
r_mdb_get_env
function that checks for invalid pointer access too
Closing this out as it looks like " Change output format for query results?" is not possible at the low level but can be satisfactorily addressed with post-processing in a higher level wrapper.
Also moved comments re the tangent about memory management to the appropriate open issue thread.
@gothub Okay, did a little more digging here.
It looks like there are bindings but not registered S4 methods for a lot of the librdf functionality to handle query results.
http://librdf.org/docs/api/redland-query-results.html
In particular, librdf_query_results_formatter_write
, librdf_query_results_to_file
librdf_query_results_to_string
look promising as far as requesting other formats. I'm guessing this would allow the user to get the results returned in any of the supported types (XML, csv, etc). I think this would mostly be valuable in the cases where the returned results are very large, and streaming them into a file may be way more efficient than having to march through them one by one in an R loop calling getNextResult
. What do you think?
Related to this, it seems like it would make sense to return an RDF Model for a SPARQL CONSTRUCT query (or other query that specifically returns triples / RDF). Looks like this just needs the appropriate methods wrapper around the librdf_*
functions, but could be tricky. (or maybe getNextResult covers that use case by just returning the triples; I'm not sure I've figured out how to write CONSTRUCT queries appropriately yet; I just get back an empty list so far).
Were the S4 methods written manually or all automated by SWIG? (assuming the latter) Any idea why they would include getNextResult()
but not S4 methods around librdf_query_results_to_file
?
@cboettig regarding more options for retrieving query results - as an option to the current way to get results:
queryResult <- executeQuery(query, model)
result <- getNextResult(queryResult)
the functions getResults()
and writeResults()
could be added to Query.R
. The call sequence could look like this:
queryResult <- executeQuery(query, model)
# Return all results as a string
result <- getResults(queryResult, format_uri, base_uri)
# Write all results to a file
writeResults(queryResult, file, mimeType, format_uri, base_uri)
👍 Excellent! yup, I think having such functions for getResults()
and writeResults()
would be grand.
@gothub Just a note that I'm realizing this would also be super useful for any SPARQL query that returns a large number of results. getNextResult
isn't as slow as addStatement
, but can still make a sparql query end up taking many minutes to return, even though the actual C call for the query runs in milliseconds.
Currently using:
getResults <- function(queryResult, format = "csv", ...){
mimetype <- switch(format,
"csv" = "text/csv; charset=utf-8",
NULL)
readr::read_csv(redland:::librdf_query_results_to_string2(
queryResult@librdf_query_results,
format, mimetype, NULL, NULL),
...)
}
in librdf
to get reasonably performant SPARQL queries (the actual queries are fast! readr
, which is pretty highly tuned, is now the slowest part).
It would be great to have access to such a method through redland
, obviously it's not good for me to use :::
inside another package....
@cboettig getResults(query, model, formatName) has been added. The librdf_query_results_to_string2() function doesn't appear to honor the mimeType
argument, as format
always takes precedence and cannot be specified as NULL
. The mimeType
argument is another way to specify the output format. To avoid confusion, only a formatName
argument is included for getResults
.
Please have a look and let me know if this is usable by rdflib
.
@cboettig i have a patch release of redlands almost ready (just have to fix a problem on WIndows) and would like to know if this function provides what you were looking for before I send this off to the good folks at CRAN
yup, this is good!
I realized of course that the librdf_*
functions are already exported, no need to use the :::
, which is nice since I've found the S4 calls can introduce a significant overhead when called repeatedly. But in this case, I think the new getResults
function is perfect. closing this out;
From what I can see, the redland sparql queries are being done with rasqal which is supposed to support a wide variety of return formats for the SPARQL queries, but I don't see how this can be done. I'd really be happy just getting XML back, but all I can see is the 'get next result' in a list. (For instance, I have 2
SELECT
variables in my query and my query gets 10 results, I get a length-20 list...)