ropensci / redland-bindings

Redland librdf language bindings
http://librdf.org/bindings/
Other
17 stars 3 forks source link

Change output format for query results? #58

Closed cboettig closed 6 years ago

cboettig commented 6 years ago

From what I can see, the redland sparql queries are being done with rasqal which is supposed to support a wide variety of return formats for the SPARQL queries, but I don't see how this can be done. I'd really be happy just getting XML back, but all I can see is the 'get next result' in a list. (For instance, I have 2 SELECT variables in my query and my query gets 10 results, I get a length-20 list...)

gothub commented 6 years ago

@cboettig The Query and QueryResult classes are low level. Instead of refactoring these classes, it may be useful to create a higher level function/class that calls Query/QueryResults to perform a query and accumulate all the results into a desired format.

cboettig commented 6 years ago

hmm... looks like the serialization return types are part of the W3C spec and the existing redland implementation though, so I wasn't sure I should be writing a separate thing. On the other hand, I see your point about it being a rather low-level return object anyhow.

I'm working on some higher-level wrappers around the redland package, just to simplify common calls (and also make them work with JSON-LD data) at https://github.com/cboettig/rdflib (figured it made the most sense to do this as a separate package).

I've added a function that attempts to tabularize the results of the query now, to return a data.frame instead, which is probably preferred by R users.

Currently my function trims the duck types (^^) off of the strings and returns everything as characters; ideally it should be coercing them into the native R types, but there's no guarantee we can parse the declared type and successfully convert it, so from a practical standpoint I think it is easier to leave that to the user who can handle the type conversions pretty easily once things are in a data.frame (i.e. as we do all the time working from our favorite csv format...)

See https://github.com/cboettig/rdflib/blob/master/R/rdf.R#L203-L212

gothub commented 6 years ago

@cboettig your rdflib package looks really useful. You might consider using the freeQuery() and freeQueryResults() methods as the redland package calls to the redland C library manage memory manually which needs to be released when no longer needed. This is one of the reasons I still want to put a higher level query function in the redland package, that would take care of this kind of house keeping.

cboettig commented 6 years ago

Yeah, good call, I wasn't sure if any of that was built in. Ideally I think there'd be some link into the R garbage collection would take care of this, i.e. once a pointer was removed from the R environment? (For instance, I think the xml2 package does this?)

it is probably need that on the parser/serializers as well.

But maybe it would be sufficient, if not maximally computationally efficient, to just run the free calls once the function is done collecting the results from the query.

Probably less obvious what to do in the serialize/parse case.

gothub commented 6 years ago

@cboettig i'll have a look at how xml2 frees memory - this happens with the xml_remove() function.

cboettig commented 6 years ago

@gothub Looks like xml2 frees memory by taking advantage of helper utilities built into RCpp, see https://www.r-bloggers.com/external-pointers-with-rcpp/, but I'm not sure if that'll do any good for the redland C bindings.

However, @richfitz pointed me to the official R-ext docs on external pointers, which, if I've understood him correctly, will do much the same thing to handle removal of pointers via garbage collection. See: https://cran.r-project.org/doc/manuals/r-release/R-exts.html#External-pointers-and-weak-references

Rich has a working example of this in his packages, e.g.

here's an example from one of mine: https://github.com/richfitz/thor/blob/ff82d7/src/thor.c#L524-L530 (use) and https://github.com/richfitz/thor/blob/ff82d7/src/thor.c#L569-L575 (definition) - there's a declaration at the top of the file too. Follows the pattern in R-exts pretty closely but with the r_mdb_get_env function that checks for invalid pointer access too

cboettig commented 6 years ago

Closing this out as it looks like " Change output format for query results?" is not possible at the low level but can be satisfactorily addressed with post-processing in a higher level wrapper.

Also moved comments re the tangent about memory management to the appropriate open issue thread.

cboettig commented 6 years ago

@gothub Okay, did a little more digging here.

It looks like there are bindings but not registered S4 methods for a lot of the librdf functionality to handle query results.

http://librdf.org/docs/api/redland-query-results.html

In particular, librdf_query_results_formatter_write, librdf_query_results_to_file librdf_query_results_to_string look promising as far as requesting other formats. I'm guessing this would allow the user to get the results returned in any of the supported types (XML, csv, etc). I think this would mostly be valuable in the cases where the returned results are very large, and streaming them into a file may be way more efficient than having to march through them one by one in an R loop calling getNextResult. What do you think?

cboettig commented 6 years ago

Related to this, it seems like it would make sense to return an RDF Model for a SPARQL CONSTRUCT query (or other query that specifically returns triples / RDF). Looks like this just needs the appropriate methods wrapper around the librdf_* functions, but could be tricky. (or maybe getNextResult covers that use case by just returning the triples; I'm not sure I've figured out how to write CONSTRUCT queries appropriately yet; I just get back an empty list so far).

Were the S4 methods written manually or all automated by SWIG? (assuming the latter) Any idea why they would include getNextResult() but not S4 methods around librdf_query_results_to_file?

gothub commented 6 years ago

@cboettig regarding more options for retrieving query results - as an option to the current way to get results:

queryResult <- executeQuery(query, model)
result <- getNextResult(queryResult)

the functions getResults() and writeResults() could be added to Query.R. The call sequence could look like this:

queryResult <- executeQuery(query, model)
# Return all results as a string
result <- getResults(queryResult, format_uri, base_uri)
# Write all results to a file
writeResults(queryResult, file, mimeType, format_uri, base_uri)
cboettig commented 6 years ago

👍 Excellent! yup, I think having such functions for getResults() and writeResults() would be grand.

cboettig commented 6 years ago

@gothub Just a note that I'm realizing this would also be super useful for any SPARQL query that returns a large number of results. getNextResult isn't as slow as addStatement, but can still make a sparql query end up taking many minutes to return, even though the actual C call for the query runs in milliseconds.

cboettig commented 6 years ago

Currently using:

getResults <- function(queryResult, format = "csv", ...){
  mimetype <- switch(format,
                     "csv" = "text/csv; charset=utf-8",
                     NULL)
  readr::read_csv(redland:::librdf_query_results_to_string2(
                            queryResult@librdf_query_results, 
                            format, mimetype, NULL, NULL), 
                  ...)
}

in librdf to get reasonably performant SPARQL queries (the actual queries are fast! readr, which is pretty highly tuned, is now the slowest part).

It would be great to have access to such a method through redland, obviously it's not good for me to use ::: inside another package....

gothub commented 6 years ago

@cboettig getResults(query, model, formatName) has been added. The librdf_query_results_to_string2() function doesn't appear to honor the mimeType argument, as format always takes precedence and cannot be specified as NULL. The mimeType argument is another way to specify the output format. To avoid confusion, only a formatName argument is included for getResults.

Please have a look and let me know if this is usable by rdflib.

gothub commented 6 years ago

@cboettig i have a patch release of redlands almost ready (just have to fix a problem on WIndows) and would like to know if this function provides what you were looking for before I send this off to the good folks at CRAN

cboettig commented 6 years ago

yup, this is good!

I realized of course that the librdf_* functions are already exported, no need to use the :::, which is nice since I've found the S4 calls can introduce a significant overhead when called repeatedly. But in this case, I think the new getResults function is perfect. closing this out;