Allow addressing of segments of files as Works

azaroth42 commented 9 years ago

Although deferred at the F2F in Portland, there are use cases that require the addressing of segments of files as Works. Some examples include:

An audio CD ripped to a single file, with offsets as to the tracks or other segmentation.
A digitized image that depicts two pages in a spread, and there is metadata about the individual pages
The division of a video of a musical performance based on different bands playing at different times

This requires multiple Works to be associated with parts of the same File. To use the book case, there would be one Work for each page, and each page would need to refer to the segment of the File that depicts it. If there are URIs for the Files, then fragment URIs could be used to refer to the areas.
The larger question is about the use of hasFile ... which of the two pages has the file? They can't both have it as that would break the files-associated-with-exactly-one-work rule.

escowles commented 9 years ago

I haven't seen an example of how this works, but I have been assuming that there would be a Work that contained the content File, and then child Work records that referred to it, with some kind of metadata to express the offset or portion of the parent Work's file. This metadata would likely be format-specific (byte offset, size/offset geometry for images, time range for audio/video, a filename for a file in an tar/zip archive, etc.).

jlhardes commented 9 years ago

This example would have to use the start time and end time information at the work levels and apply that to the single file link somehow to establish where in the file to start and stop the audio segment. Here's the original track listing to see the hierarchy being described.

https://gist.github.com/jlhardes/cd21e035ff97e96481b2

escowles commented 9 years ago

I think the segment needs to link to the parent work or file. Because there can be multiple works linking to a work, we won't always be able to tell what the "parent" work is. Since there are multiple files, it may make sense to link to each one. I'm not sure exactly how that works, but we may need a new property called something like "segmentOf".

I also think the ordering proxy linking should be more like this:

https://gist.github.com/escowles/e161f91fad6b0f0361cc

I haven't seen URLs like "http://oreproxy.org/r?what=http://purl.dlib.indiana.edu/iudl/media/avalon:3333_w1&where=http://purl.dlib.indiana.edu/iudl/media/avalon:3333" before -- are they a part of ORE? I don't think they are the right URIs to use for the proxies though. I think the proxies should just have blank node ids.

My understanding is the parent Work should link to the first/last proxies, and the proxies should link to the other proxies instead of the Work instances. So you can start at the Work, get a link to the proxies, and then follow the chain and get links to the previous/next proxy and to the resource it's a proxy for.

jlhardes commented 9 years ago

The ore:Proxy and iana: parts are not something I'm sure about at all so if it works to call them with blank node ids, that's great. I think I was going off of the examples too closely from http://www.openarchives.org/ore/1.0/rdfa-examples/rdfa-3-8.xhtml.

azaroth42 commented 9 years ago

FWIW, oreproxy.org is a service run by OCLC that tracks the use of proxies, records the link to the aggregation and redirects the browser to the aggregated resource. There's no need to use it, and in Fedora4 it would be certainly a pain as the URI is outside of the repository.

jlhardes commented 9 years ago

I expanded this example a little to include more of the original track listing and I also tried validating the RDF using the W3C service so what is linked here validates, if that is helpful.

https://gist.github.com/jlhardes/fd8bbbfc51975a1c3722

I think I'm unclear about the "segmentOf" type of property. Will hydra:hasMember and ore:proxyIn not be enough to establish how works relate to each other or what the overall parent work is? Regardless of single or multiple files, could the selected start time and end time be passed to the nearest available file at or above or could the URIs for each GenericWork make use of the next available start time and end time to establish the segment to play in the nearest file? I may be getting caught up in the XML style of this and not taking into account what would be needed to get the start/end time info from a low-level work back up through the Works as they would be defined in Fedora to where the File is actually located (I'm thinking Fedora 4). Is that why something like "segmentOf" might be needed?

I'm also not sure how the proxy info would be included as part of the object in Fedora - it seems to be separate but equal to the topmost GenericWork. Can it sit with that top level object and be applied down through the Works hierarchy no matter what level of the object is called up?

escowles commented 9 years ago

I think the link to the parent is very clear when there's only one parent. And if you only have a single parent then you may not need segmentOf links. But if the track was also included in another work (like a compilation of some kind), then you would need to know whether it was the compilation or the original work that had the file attached.

You are also using a naming scheme where you can see from the track ID that it's a part of a larger work and get the work ID from it. In Fedora, you could make the tracks child nodes of the parent Work. That's the way we've done things at UCSD too. But the model definitely treats components/tracks/etc. as their own Works, so it definitely supports giving the tracks their own identifiers instead of using naming conventions.

For the ordering, I think each level of Work needs to have its own ordering info that applies just to its members. So if you have a hierarchy of Box > Disc > Composition > Track, then the Box would contain ordering info about the Discs, each Disc would have ordering info for its Compositions, etc.

jlhardes commented 9 years ago

Yeah, my naming scheme is helping my human eyes keep track of what is part of what in this example.

Your explanation helps with segmentOf, especially thinking about these Works within Fedora. I think to be consistent it makes sense that if a work doesn't have a file and doesn't have a member (a child work), there should be a reference to the parent/ancestor work that does have the file.

I can try that with this example. It almost feels like the proxy references should play into this connection but then it wouldn't be part of the hydra namespace. I might look at something in iana like "current" or "section" to see if that can help.

I'm also still not sure about how the proxy info fits with what is stored in a work in Fedora. Proxy/ordering info is optional so it doesn't always exist. Is it a datastream on the Work object? The iana:first and last references seem to mess that up since they are within the Work description, but maybe it's fine for that to be within the Work description as long as the proxy info exists as well?

flyingzumwalt commented 9 years ago

Please finish this sentence: "This ticket is ready to be closed when..." Maybe this information should be preserved somewhere?

escowles commented 9 years ago

This ticket is ready to be closed when there is a spec for referring to a section of a file. I've added this to the agenda of the next Metadata Working Group meeting (5/6): https://wiki.duraspace.org/display/hydra/Copy+of+Metadata+Call+2015-05-06

mjgiarlo commented 9 years ago

@escowles has this been addressed yet?

escowles commented 9 years ago

@mjgiarlo: no, but it's been added to the todo list of the structural metadata subgroup of the Hydra Metadata Working Group, so I think we'll be talking about that shortly.

azaroth42 commented 9 years ago

Can it reuse the OA Selector approach? http://www.w3.org/TR/annotation-model/#selectors

mjgiarlo commented 9 years ago

@escowles :ok_hand:

escowles commented 9 years ago

@azaroth42 The OA Selectors look good -- is there a time-based selector? That seems like one of the most common uses, but I don't see more than a passing mention of it in that doc.

azaroth42 commented 9 years ago

Timed segments use the W3C media fragment approach as a fragment selector, or just on the URI of the resource directly if there's no other assertions to be made. See: http://www.w3.org/TR/media-frags/#naming-time

Or, as a Selector, written out in full:

  {
    "@id": "http://example.org/sptarget1",
    "@type": "oa:SpecificResource",
    "source": "http://example.org/thing-with-segment",
    "selector": {
      "@id": "http://example.org/selector1",
      "@type": "oa:FragmentSelector",
      "value": "t=10,20",
      "conformsTo": "http://www.w3.org/TR/media-frags/"
    }
  }

Which is equivalent to http://example.org/thing-with-segment#t=10,20 but lets you make more assertions about the different components.

jlhardes commented 9 years ago

+1 to W3C media fragments for time-based selectors from the Avalon folks! OA Selectors sounds like a good approach to investigate for file segments.

azaroth42 commented 9 years ago

Feel free to tag me in on any discussions in this space :)

escowles commented 9 years ago

I think the W3C media fragments and OA Selectors look very promising. I've updated the Hydra structural metadata group wiki page to include links to those instead of making people read this whole discussion.

barmintor commented 8 years ago

"instead of making people read this whole discussion." <- at the end of the discussion

escowles commented 8 years ago

Closing this issue to avoid frustrating more people. This work is now being done by the Hydra Metadata Interest Group Segment of a File/Structural Metadata Working Group.

samvera / hydra-works

Allow addressing of segments of files as Works #34