joelkuiper / yesparql

a Yesql inspired SPARQL library
https://github.com/joelkuiper/yesparql
Eclipse Public License 1.0
43 stars 8 forks source link

Suitability for large result sets #1

Closed RickMoynihan closed 9 years ago

RickMoynihan commented 9 years ago

Hi,

Firstly I'm really glad to see you have taken it upon yourself to fill this gap in the Clojure/RDF ecosystem!

I must confess that I've not yet tried it, but I've been skimming the source code; and it seems that the main entry points for queries are all wrapped in with-open's, e.g.

https://github.com/joelkuiper/yesparql/blob/master/src/yesparql/sparql.clj#L65

I can't recall all the details of the JENA api's but am I right that this would limit yesparql's usage to datasets which can be loaded into memory?

Does the library provide another method (without having to get too low-level) to control the with-open and consuming of results yourself?

joelkuiper commented 9 years ago

Hey Rick,

Thank you! Hopefully it will be useful to some people other than myself! I must confess that I've only used these functions with relatively small data sets (OpenAnnotations) which tend to fit in memory.

Those specific with-open's close the QueryExecution. It might make sense to defer the closing at some point, but I'm not sure how often. You'll probably want an instance of ResultSetStream somehow. This probably only makes sense for using it against Datasets directly, since I'm not actually sure how well streaming HTTP is implemented for SPARQL endpoints in Jena. I'll take a look if I can add it as an option (I will also add the timeout options to the query).

Note that If you convert the ResultSet to any serialization with one of the result-> methods then whole thing will need to be in memory anyway (probably twice). But the ResultSetRewindable should still be usable for iterative operations, albeit in memory.

RickMoynihan commented 9 years ago

Hi Joel,

Yes the with-open's will definitely force the closing of the results.

Grafter currently lets you do the following:

(-> (sparql-repo "http://localhost:3001/sparql/query")
      (query "SELECT * WHERE { ?s ?p ?o }")) ;; => returns a lazy result stream

This style is useful for development at the REPL - as you don't need to worry about with-open - but it does of course leak the resource! :-\

If you care about consuming the results properly (i.e. anywhere but when testing/developing at the REPL) then you would use:

(with-open [conn (->connection (sparql-repo "http://localhost:3001/sparql/query")]
       (do-something-with-the-results! (query conn "SELECT * WHERE { ?s ?p ?o }"))

Personally I'm not convinced that Grafters design here is ideal - because by allowing you to ignore the connection at the REPL we encourage people to forget to close the connection in production code! :-( So I'm in half a mind to prevent the more convenient throw-away usage...

In Grafter our query function lets you consume results from a SPARQL query as clojure data (we currently return a lazy sequence of maps). One thing we don't yet do is type coercion on the results back into Clojure types - i.e. if you get an integer in your results it will be wrapped in the appropriate sesame Literal object. We'd like to change this in the future to ease interop with the rest of our API (as we do RDF type coercion everywhere else!).

Regarding your result->xxx functions - in theory they needn't hold onto all the data in memory to work. To write to results to an output format (e.g. CSV) in grafter without holding onto memory you just do so inside a with-open as above. It might be better to delegate these serialisation functions to another library, such as clojure-csv, which can stream already.

We frequently use this approach to load and stream large amounts of data.

Regarding JENA and HTTP - I don't know whether their HTTP client API's let you stream properly - but the server certainly does! Sesame's sparql client implementation definitely has some problems but it works quite well and can lazily stream results; I'd be surprised if JENA couldn't do the same.

Thanks again for yesparql! It's definitely something we'd find useful, as right now we mostly write our SPARQL queries with str!

joelkuiper commented 9 years ago

Hmm interesting! Thanks for the heads-up. For now I've quickly cobbled up something that will allow you to by-pass the yesparql.sparql functions and use your own.

https://github.com/joelkuiper/yesparql/blob/develop/test/yesparql/core_test.clj#L55-L57

Essentially, you pass a query-fn with the signature [data-set statement call-options] (statement is the query and call-options the map with things like bindings). That way you can compose your own executor / query (could also be useful for doing other ARQ like things), of course you can probably re-use a bit of methods from the yesparql.sparql namespace!

I'll take a look if I can come up with a more convenient way, and also document this behavior; but for now this could work

joelkuiper commented 9 years ago

As for the streaming/results processing there might be a better way altogether; I'll read up on some other source code to see if I can re-use some ideas ;-)

RickMoynihan commented 9 years ago

I don't know much about yesql's implementation but it looks like it leans heavily on clojure.java.jdbc to abstract jdbc ResultSets into lazy seqs. Whilst letting users manage connections via a with-db-connection macro:

https://github.com/clojure/java.jdbc/blob/master/src/main/clojure/clojure/java/jdbc.clj#L630

This seems to me to be the right way to do it; as connection life-cycle/resource management is really an application concern not a library one.

RickMoynihan commented 9 years ago

I don't really understand what :query-fn does.

If I were to use this presumably I'd need to also parse the query type as you do in statement-handler. Would I then need to put the with-opens inside that function?

In Grafter we used Sesame's parsed queries to do this, which is pretty bullet proof and much more robust than using regexes ( https://github.com/Swirrl/grafter/blob/master/src/rdf-repository/grafter/rdf/repository.clj#L306 )

I've done something similar using Jena's ARQ too so I can probably find some snippets of code to help point you in the right direction here if you'd like... Though it might be hard to integrate with the rest of your code - without parsing the query twice.

I've always found JENA's multiple APIs around this area pretty confusing...

joelkuiper commented 9 years ago

With some additional help from @wagjo the query now returns a AutoClosable ResultSet, that doubles as an Iterator. That way you can consume the QuerySolutions lazily in a with-open block. The downside is that the library user is now in charge of closing the results properly, if not closed properly it will leak resources. I've updated the documentation accordingly.

https://github.com/joelkuiper/yesparql/commit/8e0ae80db18f5559fe160f412e53e1ba466eab0c

The changes are in the develop branch, and I'll release it soon!

RickMoynihan commented 9 years ago

This looks to be a lot closer to what I was getting at! :-)

What is the reason for wrapping all of those methods? For example would it not be better to not implement java.util.Iterator and instead implement clojure.lang.Seqable? This would mean in the case where you don't want to manage the resource you can do things like:

(map some-transformation (query ...))

And it will just work... Where as your current implementation would require you to do something like:

(map some-transformation (iterator-seq (query ...)))

If you want to manage the life-cycle... you can then do:

(with-open [res (query ...)]
    (map some-transformation res))

If you want to provide access to the underlying JENA objects too - then it might be better to simply provide accessors on your CloseableResultSet to the underlying ResultSet and QueryExecution... so if you really need to access the machinery you can e.g.

(let [res (query ...)
       qe (->query-execution res) 
       res-set (->result-set res)]
 ...)

There might be good reasons for you not to do this - so it might need some more thought, but it seems something like this would require a lot less friction. Especially if when you call seq on the result of a select query you get back a seq of hash-maps - whilst on a construct you got a seq of (defrecord Triple [s p o])'s. :-)

Thanks again for entertaining my thoughts - and apologies if I'm pushing in a direction you don't agree with...

joelkuiper commented 9 years ago

So I've started working on your idea :-)

https://github.com/joelkuiper/yesparql/blob/feature/type-conversion/src/yesparql/sparql.clj

Let me know what you think, and I'd love to get some help with this too!

joelkuiper commented 9 years ago

Released as 0.2.0-beta on Clojars, see the README or docstrings on yesparql.sparql for details. Let me know what you think :smile:

RickMoynihan commented 9 years ago

Still not got around to trying this yet - but I issued a PR on the README to encourage more idiomatic usages and not potentially mislead people into unsafe practices like holding onto the head of a potentially large result set.