joshsh / sesametools

A collection of utilities for use with OpenRDF Sesame (as of recently, Eclipse RDF4J)
Other
50 stars 11 forks source link

remove "leading statement" constraint from RdfListUtil methods #21

Closed joshsh closed 12 years ago

joshsh commented 12 years ago

RdfListUtil.getList and RdfListUtil.addList could be useful in situations where an RDF List is known (only) by the URI of its head, but currently they expect a "leading statement" pointing to the list. For example, it would be nice to support patterns like the following:

Graph graph = ...
List<Value> values = ...

URI myList = vf.createURI("http://example.org/ns#mylist");
RdfListUtil.addList(myList, values, graph);

Would it make RdfListUtil less useful for current applications if we were to take out the leading-statement logic, changing the method signatures to the following:

public static void addList(Resource head, List<Value> values, Graph graph, Resource... contexts) and public static List<Value> getList(Resource head, Graph graph, Resource... contexts)

Note that I am also suggesting a varargs "context" parameter in getList.

@ansell, please let me know what you think.

ansell commented 12 years ago

My use case initially was to add the list into the graph using a subject and predicate as the attachment point to the rest of the graph. Hence the Subject-Predicate lead in.

I don't mind adding separate list creation methods. It would be easy to put the actual implementation of the list creation in that code, and reduce the amount of code in the lead-in version. It would be very useful to keep the leadin methods to avoid having to manually attach the head in cases where you do want the list attached to a particular Subject-Predicate combo.

I agree that the contexts should be varargs in all cases. Not sure why I initially set one to varargs but not the other, but I think I had some reason for doing it at the time.

joshsh commented 12 years ago

Ah, yes, extra methods would give us the best of both worlds. Perhaps rename the current methods getListObject / addListObject or such, and let them call the new, slightly simpler methods internally.

ansell commented 12 years ago

Okay, it will be easy to make those changes. I should get to it this afternoon. I will push the changes to the develop branch on my repository.

ansell commented 12 years ago

I looked at it again and realised there are still a few more cases that we might want to allow for.

What do you think about the following methods in addition to those discussed above:

public static Collection<List<Value>> getLists(Resource subject, URI predicate, Graph graph, Resource... contexts)
public static Collection<List<Value>> getLists(Collection<Resource> heads, Graph graph, Resource... contexts)

So far, getList has been limited to getting a single list, but it doesn't need to be like that in either of the lead-in or free-headed use cases. It you don't want to accept a forked list for validation reasons etc., then getList could accommodate that case, but getLists could allow for forked cases easily.

We could implement all 6 of these cases using the "Collection<List> getLists(Collection heads..." method as the base implementation.

I still don't see any case for cyclic lists, but if there are use cases where cyclic lists make sense then we could explore an implementation that accepts them.

ansell commented 12 years ago

There is an implementation and some tests in the develop branch of my repository. I would send a pull request but you do not have a copy of that branch in your repository yet.

I haven't implemented forking, but it could be integrated into all of the getLists methods in future as they all share a common helper method.

In addition, you may notice that none of the current tests use the contexts parameter, so more tests will be needed in future but for now the basic implementation is working based on the null context tests.

joshsh commented 12 years ago

Thanks. Very nice. I'm about to push some light changes (just formatting and documentation), but I think another round of more significant changes may be in order. Specifically:

Btw. I see your code comments about distinguishing between "no context" and null context. I know that as far as Sesame's Sail implementations go, an empty context array indicates "all contexts" for a query and "null context" for a write. A null in a context array indicates "null context" for both queries and writes. Does that answer the question?

Btw. I suggest that we permit both forking (in the backbone of the list) and multiple rdf:first values.

joshsh commented 12 years ago

W.r.t. your previous comments:

ansell commented 12 years ago

I reimplemented the core code again today to allow for forked lists.

Currently the RdfListUtil class is completely stateless. There would be no difference between having an object that takes options in its constructor, and having a context object that you can pass to static methods to indicate the options to use for the current call.

I am not sure what the semantics of having multiple rdf:first values in an ordered list are. It should be possible to reduce the checks in the second half of getLists to allow for multiple rdf:first values, but it should be optional so that people who don't expect these lists can get errors if they are encountered.

If you need an unordered list, you could try RDF Bag instead of RDF List. An RdfBagUtil class could also be useful in addition to this class.

Cycle detection is very necessary to avoid accidental or intentional Denial of Service situations, as the current non-recursive implementation will go on forever if cycles are not detected. I temporarily removed cycle detection while I was reimplementing the code to work with forked lists and all of the non-cyclic tests passed. I put the cycle detection code back in at the appropriate places, but it could be turned off easily with preferences if ArrayList.contains(object) is too slow on very large lists.

There are 6 different methods available right now:

All of the getList* methods rely on getLists for their core functionality. getListAtNode uses getListsAtNode as an intermediary as well.

The only difference between them is that the getList and getListAtNode methods have the following RuntimeException if they are used with forked lists:

throw new RuntimeException("Found more than one list, possibly due to forking");

To avoid that exception, people need to use getLists or getListsAtNode.

ansell commented 12 years ago

I reimplemented it again. If you want to review it, that would be great.

The cycle detection isn't currently properly implemented.

I reformatted it using the eclipse standard formatting style. I do not personally use the brace beside style on my projects, as I don't find it intuitive, so I am not sure if it is what you expect.

I haven't changed addList or switched from static methods to non-static, but I don't mind the idea in principle as long as a non-static version is still threadsafe and quick to instantiate.

It fits my current purposes so I am going to work on integrating it into some of my code in its current form. The method names shouldn't need to change, and the signatures shouldn't need many changes so it won't be hard to migrate in future.

joshsh commented 12 years ago

I really like the functionality of the current utility, but I believe the code could be simplified. Please see the "simple" implementation of getLists which I have checked in (the multi-headed version is currently commented out). It's very compact, but does what I would expect it to (matching all valid non-cyclic lists with a given head, and silently ignoring anything which doesn't match), and it passes all of your tests apart from those with expected exceptions. Am I missing anything, or is this good enough?

Yes, I think non-static methods would be nice, as they would keep the method signatures simple, allowing user to call preference setters only when necessary/desired. But I will leave that up to you.

Our IDEs will probably continue to fight over formatting :-)

ansell commented 12 years ago

I had a simple version before I looked at making it work for forked lists.

(Did you forget to push your simplified code to Github or are you referring to code from last week?)

joshsh commented 12 years ago

Sorry, pushed to master instead of develop. Try now.

ansell commented 12 years ago

It definitely looks cleaner than my implementation.

I am going to try and break it with some more tests, as I still don't fully understand why a single buffer works in general, but I think the current tests may not be testing it properly. Your recursive implementation definitely looks cleaner, and runs with a lower total memory footprint than my iterative approach. The ever present stack overflow concerns with any recursive implementation may come into play in very large graphs if people use it that way.

Had a small distraction with json-ld as my stanbol snapshot seems to have been updated recently and is no longer compiling with the jsonld module. I haven't fixed it, so I won't send a pull request, but feel free to checkout the "feature/fixjsonld" branch from my repository and merge it into yours without a pull request if you have suggestions for fixing it. In general, I am regretting suggesting that we integrate JSONLD into Sesametools. Everything about it seems hacky, as illustrated by the coerce statement.

ansell commented 12 years ago

I added some more tests to push the boundaries for shallow and deep lists. I understand how the buffer works now, it just relies on the cursor moving up and down the tree in place, and it takes a copy of the buffer when it hits a leaf. I added an optimisation using ArrayList for when it takes a copy, as you know the depth at that point.

1000 is around the actual stack limit for the recursive implementation on my machine, and probably hardcoded into the Sun JVM somewhere. I added in the option of switching to the iterative method for lists longer than 1000, and made up a test for the switch.

I modified the recursive method to add in the relevant exceptions and checks for incomplete lists and the option to throw an error on a cyclic list.

If you want to switch from static to non-static, the three current boolean constants are probably a good start on settings for the non-static implementation. I personally prefer it as it is right now since I am not likely to want the strange behaviour that you get if you turn off either the cyclic or incomplete test checks. If there was a new class that offers the extensibility and less checks you could test each setting in a separate unit test which may be useful.

Feel free to pull off my develop branch as all the tests pass for me (except the json-ld module tests)

joshsh commented 12 years ago

Good. I don't know if I would include "recursive" in the name of the method, as we could certainly replace it, with some amount of effort, with an iterative version which is not subject to the same depth constraint. The only difference API-wise is that it starts with a single head resource, which I find convenient.

We do have a slight difference in philosophy in that I definitely prefer soft failure on incomplete lists: where you see bad lists, I see no list. E.g. a list-like structure which does not end in rdf:nil is not a list, and therefore is simply not matched as a list. I agree that the test for cyclic lists is generally a good idea, although it's possible that the performance gain of turning it off would be worthwhile in some applications.

Btw. sorry for the slow response. It's that time of the semester :-)

ansell commented 12 years ago

The "recursive/iterative" methods are both public right now, but they can be switched to private and everyone can use the getLists method so we don't commit ourselves to a particular implementation.

We could make use of a non-static version that exposes the validation and/or implementation settings as our philosophies differ, although not in a completely incompatible way. It should be a simple refactoring to switch.

ansell commented 12 years ago

See pull request #25 for the non-static conversion. That should close this issue when it is resolved.

joshsh commented 12 years ago

Great. Btw. if we ever have a need for additional preferences/parameters, I imagine we will make the parameters non-final and give them setters.

ansell commented 12 years ago

The only reason for final right now, with no setters, is to ensure threadsafeness if the object is shared between different threads.

From one perspective, instances of RdfListUtil are very cheap to create, with only boolean instance variables, so sharing shouldn't be necessary as new instances can be created virtually instantly.

On the other hand, the class is designed to not use member variables for any storage, so it is guaranteed to be threadsafe without synchronization and should be easy to create once and use many times. However, this only holds as long as the option variables do not introduce random effects with different threads being able to set them at different times. Hence why the options are final with no setter methods currently.

Right now the usecases that I can think of could just as easily be provided by the inexpensive creation of new instances. For example, there may be a long running thorough list process going on in one thread, which could take many seconds, and in that time, if another thread decides it wants to do a quick and dirty process, with no error checking, it could either create a new utility, or it could reuse a current utility. If it reuses the current utility it would need to set the variable first, and that would introduce random effects on the other instance.

Feel free to expand the current set or provide setters for the current attributes if you have a usecase for it.

For what it is worth, I can't visually distinguish between the test times for RdfListUtilTest.testGetListsForkedValidStressDeepAndShallow and RdfListUtilTest.testGetListsForkedValidStressDeepAndShallowNoErrorChecking, they both take around 1.3 seconds for me on my work computer. Given that, I am not too worried about keeping error checking on as long as I only expect to have valid non-cyclic lists.

joshsh commented 12 years ago

That's good to know (re: the performance testing).

I think we could safely assume the setters would be used only at construction time. Various Sesame classes make similar assumptions. For example, that changing the data directory of a Sail which is already in use can be expected to produce unpredictable behavior, and should be avoided. However, since there are only three parameters at present, I think either solution is fine.

We still do disagree about the purpose of error checking... I don't see omission of error checking as a "quick and dirty" process but as a process of finding all solutions to the list pattern for a given list head. The params accommodate that difference of opinion very nicely.