ruby-rdf / rdf

RDF.rb is a pure-Ruby library for working with Resource Description Framework (RDF) data.
http://rubygems.org/gems/rdf
The Unlicense
382 stars 92 forks source link

RDF List w/ URI subject #145

Closed no-reply closed 10 years ago

no-reply commented 10 years ago

I've run into some bad (or maybe just confusing?) behavior on RDF::List.

list = RDF::List.new(RDF::URI('http://example.org/blah'))
list.subject 
#=> #<RDF::URI:0x3fce9fc176c4 URI:http://example.org/blah>
list << 'blah'
list.subject 
#=> #<RDF::Node:0x3f8d00c89408(_:g69874836083720)>

As far as I can tell, this means that all rdf:Lists are identified by BNodes. Since the initializer accepts a subject, I'm guessing this isn't the intended behavior.

I have a potential patch here, but since this is my first contribution, I wanted to check that I wasn't missing something before submitting a pull request.

gkellogg commented 10 years ago

The RDF vocabulary does not put any restrictions on the subject nodes of a list (collection), but for a list to be valid, all of its nodes must be BNodes, and predicates must be rdf:first/rdf:rest. The interface does allow you to specify any RDF::Resource (URI or Node) as the subject, and you can point this at a list composed of IRI nodes, but the result won't be valid.

In this case, it seems you're using the subject argument to RDF::List#initialize and using this to identify an empty (non-conformant) list, and would like to be sure that subject remains when you add an element. Firstly, an empty list is identified by rdf:nil, so it doesn't make a lost of sense to call an IRI an empty list. With the existing (and proposed) logic, inserting a node, other than when the list is empty but subject is an IRI, would yield a list identified by the BNode of its first element. I don't see why this case is really exceptional.

We could contemplate a change which constrained the subject to be an RDF::Node, and did some sanity checking on this.

Is there some particular use case you have in mind?

(BTW, I really do appreciate the pull request along with unit test.)

no-reply commented 10 years ago

I'm contemplating two potential use cases, neither of which I'm really sure I will pursue:

(1) A "collection" of resources (in my case, digital archival objects) that is really no more than an ordered set, but would be a first-class resource in need of a permanent identifier. (2) A "compound object". For example, a digitized book, made up of scans of individual pages.

ex:compound_resource dc:title "A set of resources" ;
    rdf:first ex:resource1 ;
    rdf:rest _:bnode1 .
...

In either case, I could change the model to something more like below. For (2), this may be the best way; I could better describe the relationship between the book and its pages with the added statement/node, but there are other considerations that have me considering modeling "compounds" as subclasses of rdf:List. For (1) it seems like cruft.

ex:collection dc:title ""A set of resources" ;
    ex:members _:list .
_:list rdf:first ex:resource1 ;
     rdf:rest _:bnode1 .

I've had a hard time finding solid information about rdf:Lists in actual use, but I can't seem to turn up any reason in http://www.w3.org/1999/02/22-rdf-syntax-ns or in any of the specs to restrict list nodes to BNodes. If there's a reason for the limitation, I would be up for writing in stricter constraints and documenting them if someone can fill me in on what I'm missing. If not, I gather this is a bigger rethink of the RDF::List code than is reflected in my diff. I figured it might be and would be happy to dig deeper into it.

no-reply commented 10 years ago

@gkellogg says:

lists in Turtle look like (:a :b :c), which turns into a series of rdf:first/rdf:rest terminated with rdf:nil. The subject for each of these is a BNode. RDF certainly allows IRIs to be used instead of BNodes, but no serialization format supports this first-class, and RDF.rb doesn't consider these to be valid. (#valid? returns false).

I guess this leaves me with two questions.

First, and less significant to the pull request: are my use-cases ill-advised? I'm really just grappling with lists after avoiding them for years and I'm interested in being talked into a better pattern, if it exists.

Second: If I submit a pull request with full support for RDF::Resources as nodes in lists, would that be valuable? Alternatively, would a pull request more explicitly restricting list nodes to RDF::Node be preferred?

gkellogg commented 10 years ago

There are alternatives to RDF Collections (lists) that sometimes make more sense, particularly from a querying perspective. Lists aren't really first-class RDF things, but they're pretty close and obviously widely supported.

You might want to take a look at a vocabulary solution such as the ordered list ontology.

I really don't see the advantage of allowing non-BNode nodes in a list are. I certainly would consider something that required that the list head be an RDF::Node, however.

no-reply commented 10 years ago

Okay, I think this conversation has convinced me to look for another approach.

I'll play around with the RDF::Node requirement and perhaps push that back up later. For now, I'll go ahead and close this.

Thanks for the input.

nyarly commented 10 years ago

I haven't come close to an implementation, but I've been considering rdf:List for doing a LD pagination - 10 items in this response, with the last rdf:rest being a URL for the next 10 or something. OLO seems like it would admit to that use case much better without the baggage associated with rdf:List, though.

gkellogg commented 10 years ago

Large sequences are inefficient with Lists, due to the linked-list representation. if you need to index into the data, then an OLO makes sense, at least for relatively large collections.

Trying the use IRIs as nodes within a list is legal, but you'll run into a lot of problems trying to serialize this, and I don't think it's really worth it.

RDF really needs a first-class collection mechanism; perhaps in a 2.0 variant, if that every comes about.

nyarly commented 10 years ago

I've run into the problems trying to serialize a List with named nodes, and gone through the whole Wagnerian tragedy of "what! why?! oh."

I'm glad to hear that I'm not the only one who finds them tricky.

gkellogg commented 10 years ago

That said, I'm disinclined to change the API to require the subject to be a BNode rather than a Reaource. At some point, we may introduce skolumizatiin, which would turn BNodes into IRIs in some representations. They may not be valid, but the need to be represenatble.

no-reply commented 10 years ago

I suspect the current implementation around line 290 will cause a problem for likely skolemization implementations. It's an unexpected change to the subject URI. My patch might not fix it in the best way, but I think the pattern should be to store the subject passed to the initializer and only generate a new BNode if it doesn't already exist.

@nyarly: I guess I'm realizing why I've avoided lists for so long... I have colleagues who are using them extensively in existing data, though, so I may end up needing to implement something for community reasons.

gkellogg commented 10 years ago

The thing is, if the list is empty, then that's inconsistent with having a subject. The purpose of the subject when creating an RDF::List is to connect it with a pre-existing linked list within a graph. If no graph is provided, or the graph doesn't contain a subject with rdf:first/rest predicates, then that could indicate an error condition.

I might suggest the following changes:

Thoughts?

no-reply commented 10 years ago

I think your pattern is okay. The only downside I see is that it makes it hard to initialize a list before grabbing data. If we go with the changes you suggest, I would suggest a mechanism to change the subject on the graph (#subject= ?). That way you could do:

list = RDF::List.new
list << '1'
list << '2'
list.subject = RDF::URI('http://example.org/positive_integers_less_than_3')

The only other point I would make is that the current implementation allows passing a subject to a new list, and #subject returns it until the list is non-empty. If we feel strongly about not storing a subject for empty lists, that's something we should fix. If not, we could always implement my PR, plus:

def subject
   return RDF.Nil if empty? #or just 'return nil if empty?' ?
   @subject
end 
nyarly commented 10 years ago

Ultimately, the challenge here is that RDF::List is tightly coupled to the underlying graph - the items-in-an-array is stored immediately in the underlying repository. So it can be tricky (and counter-intuitive) to manipulate lists and have the graph remain valid throughout the operation. Maybe a method that yielded a ListManipulator or something?

On Mon, Jan 13, 2014 at 1:17 PM, Thomas Johnson notifications@github.comwrote:

I think your pattern is okay. The only downside is that it makes it hard to initialize a list before grabbing data. If we go with the changes you suggest, I would suggest a mechanism to change the subject on the graph (#subject= ?). That way you could do:

list = RDF::List.new list << '1' list << '2' list.subject = RDF::URI('http://example.org/positive_integers_less_than_3')

The only other point I would make is that the current implementation allows passing a subject to a new list, and #subject returns it _until_the list is non-empty. If we feel strongly about not storing a subject for empty lists, that's something we should fix. If not, we could always implement my PR, plus:

def subject return RDF.Nil if empty? #or just 'return nil if empty?' ? @subject end

— Reply to this email directly or view it on GitHubhttps://github.com/ruby-rdf/rdf/issues/145#issuecomment-32212365 .

gkellogg commented 10 years ago

I think your pattern is okay. The only downside I see is that it makes it hard to initialize a list before grabbing data. If we go with the changes you suggest, I would suggest a mechanism to change the subject on the graph (#subject= ?).

The initializer accepts an array of values, which seems to be adequate to me, but it sets the subject first, so you'd still be left in the same boat. This could be addressed if the subject was (re-)set after the values were pushed. It also accepts a block, and if the subject is set after the block is called, it would also work. For example:

list = RDF::List.new(RDF::URI("foo"), graph, %w(a b c d))

Would then give you a list denoted by "foo"; although that list will not be valid, and will not serialize natively in any existing RDF format, unless the subject is a BNode.

The only other point I would make is that the current implementation allows passing a subject to a new list, and #subject returns it until the list is non-empty. If we feel strongly about not storing a subject for empty lists, that's something we should fix

Yes, if a subject is passed and at the end of initialization the list is empty, that would raise an error.

gkellogg commented 10 years ago

Actually, on second thought, I'm just going to stick with it's being an argument error if the subject does not denote a list in graph. This is really what the pattern is intended for, rather than being used to set the subject of the first node in a list. Otherwise, it involves initializing the list, using subject, and then doing a special case if the list was empty before or after values or block initialization. This just seems messy to me. If you really want a list to be initialized with a subject, then create the subject with first/rest before using List.new. This may seem less ideal, but it makes the code much cleaner, and sticks to the intent of the interface.

I'll implant it in a feature branch and attach shortly.

gkellogg commented 10 years ago

See issue #147.