Object with Files and FileSets?

azaroth42 commented 8 years ago

@no-reply @mjgiarlo @cbeer @jpstroop

In Scenario 5, there is a digital object (a book) with its own files (a PDF and OCR'd text) as well as a set of components (pages) with their own files (TIFF and OCR).

There seem to be several options here:

(1) The book is an Object that hasFile the files, and you dig through tech MD to find out more about them.
(2) The book has files, and is thus a FileSet. That implies that FileSets can have member FileSets.
(3) The book has a "self" FileSet for the PDF and OCR, and component FileSets for the pages. The self FileSet should then be distinguished and not part of the ordered list of page FileSets.
(3b) Further distinguish ComponentFileSet [becomes a Canvas in IIIF] from ObjectFileSet [becomes a rendering of the Manifest in IIIF].

jpstroop commented 8 years ago

I think Hydra::Works breaks this--maybe. In pure PCDM, the Book is a pcdm:Object, with files representing it in its entirety, like a PDF, associated via pcdm:hasFile. Each page is also a pcdm:Object, with associated files representing the page only.

I'm not sure how you do this in Hydra::Works. Maybe there's a FileSet that's part of the work but not part of the order? :-(

mjgiarlo commented 8 years ago

A naïve Hydra::Works-based model (w/o ordering represented...) could look like this: 2016-03-14 14 29 03

Maybe this is what @jpstroop was :frowning: about!

azaroth42 commented 8 years ago

Right, pure PCDM is option 1... Object with files and more objects.
In Works, a FileSet that's not part of the order but is a member is (3) or (3b) Or the Object also becomes a FileSet, as it has a set of files plus the member FileSets ... which is (2). For the record, I don't like (2).

The version from Mike is (4) ... Introduce another Object to represent the page separate from the FileSet that holds the Files. This is what we decided /against/ at the SD HDC ... no need for the object that becomes the Canvas separate from the FileSet that holds the images (and other content).

If (4) is still on the table, it's (still) my preference.

mjgiarlo commented 8 years ago

@azaroth42 @jpstroop Can you refresh my memory about the downside of (4)? Is it just that the Page-level Work there is not necessary to make the mapping to IIIF (in which case, OK but ¯\_(ツ)_/¯ :question:) or is there more to it than that?

azaroth42 commented 8 years ago

I believe (Jon correct me if I'm misremembering) it's that the Page object would be just another object to maintain a URI for without much value... you can add label and so forth to the FileSet directly, such that it works out at least 95% of the time.

There is some value, but mostly the 5% cases that Shared Canvas deals with:

Aligning multiple images or content in a single frame of reference
Having multiple filesets for multiple digitizations of the same page (2001 vs 2015)
Having multiple filesets for different aspects of the same page (image vs text)
Having page specific information about the image -- e.g. cropping boundaries to remove scanning bed
Pages without digital content but with notes (e.g. back of the photograph has a signature but no image)
Point of reference for external alignment of other filesets (e.g. folio from one institution, miniature from another)

mjgiarlo commented 8 years ago

The main value of the extra PCDM object is that it aligns well with the Hydra::Works model -- we can do this using current codebases without making a single modification (right?). I'm not saying that concern overrides all others, but I'd just toss that in as a valuable thing to keep in mind. (Have I mentioned that I'm a fan of models for which there's already working code? :grinning: )

jpstroop commented 8 years ago

I believe (Jon correct me if I'm misremembering) it's that the Page object would be just another object to maintain a URI for without much value... you can add label and so forth to the FileSet directly, such that it works out at least 95% of the time.

I honestly can't remember why--maybe the restrictions of H::W just make it feel like kludge? I suppose I agree that (4) is the way to go if we don't want to bend H::W. I think @tpendragon was in on that conversation too. Maybe he remembers something?

tpendragon commented 8 years ago

I've got a lot of thoughts and it's hard to organize them in my head, so I'm just gonna stash them all here and let them be commented upon.

I think what @mjgiarlo posted (option 4) is the expanded version of every possible interaction with Hydra Works. It basically reads like this: "Works are representations of the conceptual or physical - if a book has pages, a work (the book) has child works (the PHYSICAL pages). FileSets are representations of the digital file that sits on disk. If the book has a PDF, then it has a FileSet (the digital). Its conceptual pages (works) each have a digital representation (or doesn't!), each of which has a FileSet. Only digital things make sense to have binaries, so they have Files."
This is both good and bad. It's ridiculously expensive to make a hierarchy that deep for everything, and querying it quickly becomes a nightmare - even if you did have something like SPARQL in your toolset, which we don't. @mjgiarlo says we can do this using current codebases without making a single modification, but we have no UI or interaction for actually doing this. Even in Plum it would be hard - LOTS of clicks. HOWEVER, if everything's modeled this way then a computer can have a generally sound interaction model with all things hydra:works and be able to reason about what to do at each layer.
This probably shouldn't matter, but the IIIF mapping becomes...complex. Before FileSets were canvases, this would mean something more like Works are Canvases if they only have FileSets, and even more (although this probably should have always been the case) Works are Canvases if they only have FileSets which have an original File that's an image according to its technical metadata. Although maybe that's not right either? When you have a Collection which has a Work that has a PDF and images, should the PDF be on the canvas via something like IxIF or should it be its child works?

In summary, I THINK I prefer option 3 for "thing we can do easiest and I can imagine a UI for", but if this expanded model is something we can use everywhere (and my description is accurate) then maybe it's worth the expense, and it would make @cmh2166 happy I think.

tpendragon commented 8 years ago

Oh, also

3) In option 4, is the FileSet ordered? Prooobably not, and maybe that's where the IIIF mapping works out.

azaroth42 commented 8 years ago

Thanks @tpendragon !

Agree that 4 is the most expanded version, and with your reading. Given #12 (distinguish physical and digital in the Hybox models), plus the aim of having an extensible model that works with both image based and more traditional institutional repository objects, I wonder if the value of consistency across hybox instances and uses outweighs the cost of the additional objects.
I hear you about the expense! Even in Triannon with relatively simple annotation objects, the number of interactions was very high :( That said, the costs of Fedora4 interactions is something that we (the community) have already noted, and should work with Duraspace and F4 committers to reduce. So while implementation efficiency is still important, making choices that will be hard to change later based on the current state of technology today seems like it would be good to avoid if there's at least hope that the technology will improve.
I'm not sure that I follow. The Page object would be the Canvas, and the FileSet is more like the annotation that links the content to the canvas? So the page can then maintain the separate height and width of the canvas, without colliding with the height and width of the images. I'll build a diagram :) For the PDF of the Collection/Object, there's also the rendering property (http://iiif.io/api/presentation/2.1/#rendering) for resources that represent the entire thing, rather than individual views.
I don't think that filesets would be ordered. What would the order represent, other than perhaps a preference for using the first resource rather than the last?

In terms of UI/UX, I don't think that the system should try to mirror the model directly into the UI. In particular, the distinction between RWO and the digitized object is important for some cases, but the vast majority of the properties can apply to one or the other -- a single form could be used to capture the information, regardless of where that information ends up in the persistence layer that implements the model. It might also be good to have admin configurable templates for these, such as setting up controlled vocabularies to use, fields to hide/show, fields to auto-populate, and so on.

mjgiarlo commented 8 years ago

While I catch up on this thread, to be clear about my own intent: I am :100: OK with bending Hydra::Works or just using Hydra::PCDM, or another Hydra::PCDM-based library, for our needs here. I thought it prudent to have at least one option that is the naïve Hydra::Works model.

azaroth42 commented 8 years ago

I think the overall model looks something like:

hybox

And the mapping for IIIF falls only on the Digital Object side of the line, with metadata copied across from the RWO side. The mapping would be:

Collection --> Collection Object --> Manifest (order of object) --> Sequence Part --> Canvas FileSet --> Annotation File --> Content TechMD --> info.json for images

And then FileSet for an Object or Collection would be rendering, rather than an Annotation.

The provenance/history/versioning features of the repository likely wouldn't be put into the IIIF, but are important to capture for HyBox on the digital side.

tpendragon commented 8 years ago

:+1:. It's just determining when a Work is a Part, yeah? It may just be "if it's ordered". There's also an issue where width/height are required for the canvas, and if there's no FileSet then there's no width/height...probably

azaroth42 commented 8 years ago

Yes, determining when a work is a part is key. Given that in IIIF all canvases must be part of a sequence, the ordered-ness is a great way to do that. :+1: from me.

The part should be able to record the h/w in the event that there isn't a fileset. And provides a resource to hang other properties off like the description/note that the back of a photograph has a signature on it, but there's no image that depicts it.

cmharlow commented 8 years ago

Thanks for cc'ing me, @tpendragon.

I like the overall model @azaroth42 just shared (+1 to option 4 from my viewpoint), and my questions are more about what relationship we use to connect HW:Work/PCDM:Object to RWO (or whatever other instances in other repositories/domain models we want to link to - WEMI Works, CHOs, etc.), and if that RWO/CHO/WEMI instance gets representation somehow in Fedora/PCDM.

FWIW in this discussion, we've got some edge cases currently where we want to make descriptive metadata assertions on a HW:Work/PCDM:Object that stands in for the Page (basically migrating legacy functionalities). @azaroth42 covers similar needs in his comments.

Here is a simple overview of what I'd hoped we would do (we've gone the route of conflating of the dpla:SR instance/stand in for descriptive metadata on the WEMI Work with the Book PCDM:Object). The red writing are the very specific metadata needs that made me originally look at this idea.

Hope this helps, and that I didn't misunderstand the discussion so far. Thanks for all you guys do.

azaroth42 commented 8 years ago

@cmh2166 -- our graphs look isomorphic, which is encouraging that independent analysis came to the same result :)

My thinking is that for HyBox we would want to include one level of RWO for Object and Collection to act as the resource that maintains the descriptive metadata. This would be a relatively simple resource that can be replaced or extended as desired on a case by case basis, without breaking the digital object model. If someone wants to make a module for BibFrame or CIDOC-CRM or BIBO or ... then there are clear anchoring points across that black line.

We should discuss next week at LDCX :D

tpendragon commented 8 years ago

@cmh2166 We should talk about the RWO vs intellectual work split at LDCX, because if you include "file on disk" I think there's three levels, there, versus the two that would be here.

cmharlow commented 8 years ago

@azaroth42 - Absolutely, and I admittedly am sharing here for a different context than what you've got for Hybox (and I know nothing about IIIF other than it seems pretty rad and I wish I had a reason to be involved with it).

@tpendragon - yes, I agree. From my perspective, it's a question of what metadata domain models we want to bring over to pcdm representation versus deciding pcdm represent digital objects/collections and building bridges to those other domain models as described/stored elsewhere.

So, I like generally where this is going with option 4, feel bad about the performance inefficiencies though this can create, and I still have questions for Hydra::Works/PCDM more broadly. We can def discuss next week.

I'll let you guys get on with planning the hybox revolution.

azaroth42 commented 8 years ago

@tpendragon Good point! pcdm:File is the file, so the pointer doesn't really exist in @cmh2166's diagram. I skipped over that. Also the OCR file currently on the AF:Book would live inside a FileSet. So maybe not entirely isomorphic yet

cmharlow commented 8 years ago

@azaroth42 OCR is a lingering question (where to put and how). The file pointers to AWS will be added in an upcoming test instance (right now, we're just storing files in fedora for sake of testing all this in a sandbox). Any advice on how to handle those pointers, +1.

azaroth42 commented 8 years ago

We have the same issue at Stanford for very large files (video and web archives in particular) where we will need Fedora4 to somehow manage the metadata for content that isn't directly "in" Fedora. It's (IMO) an important question, as the distinction is between an RDFSource and a NonRDFSource (in LDP terms). I've been assuming that this will work itself out, but it needs to be scheduled (tag @cbeer @mjgiarlo @hannahfrost @anarchivist) as a not insignificant piece of work

escowles commented 8 years ago

@azaroth42's list of reasons to have a Page object separate from the FileSet seems reasonably convincing to me. I don't think all of those absolutely need a separate Page object, but it would certainly make them more elegant.

Is the intention to always use separate Page objects, or to use them for the 5% of cases that need them?

azaroth42 commented 8 years ago

I would prefer to always use them, rather than have to have the developer / admin / whoever make a choice. And then have to test for the results of that choice all the time in the code.

From the SD HDC, if we can use batch ops to F4 to speed up some of these interactions -- both create and retrieve -- I would like to believe that the cost of the additional objects will be relatively low.

tpendragon commented 8 years ago

I would prefer to always use them, rather than have to have the developer / admin / whoever make a choice.

:+1: Branches are the death of productivity.

mjgiarlo commented 8 years ago

@tpendragon :speech_balloon:

Branches are the death of productivity.

:+1:

escowles commented 8 years ago

Though, AFAICT, the current F4 batch operations draft spec doesn't anticipate fewer HTTP requests -- it's really a refinement of the current transaction support to accept or abandon a set of changes, not a way to doing a bunch of operations in a single request.

mjgiarlo commented 8 years ago

@tpendragon :speech_balloon:

@mjgiarlo says we can do this using current codebases without making a single modification, but we have no UI or interaction for actually doing this.

Fair point. I was assuming a near-term future where these tickets blocking the Sufia 7.0.0 release were already done. Sorry, product owner blinders were on. :wink:

mjgiarlo commented 8 years ago

@azaroth42 :speech_balloon:

We have the same issue at Stanford for very large files (video and web archives in particular) where we will need Fedora4 to somehow manage the metadata for content that isn't directly "in" Fedora.

Is this a HyBox need or a different need? Which of our many needs are you referencing here, @azaroth42? :wink:

azaroth42 commented 8 years ago

Not a need that has been identified for HyBox to my knowledge, but certainly one that has come up at Stanford. However, if the median repo size is on the order of 5TB, I do wonder whether it /is/ actually a HyBox need? We could push it to Business to decide?

mjgiarlo commented 8 years ago

Meh, I don't think it is a HyBox need at the moment so we should punt. Thanks for clarifying!

azaroth42 commented 8 years ago

Confirmed the sketch in https://github.com/hybox/models/issues/17#issuecomment-196919583 with @hannahfrost as being acceptable, if it is to others :)

mjgiarlo commented 8 years ago

@azaroth42 No objections. The only discrepancy I see between that sketch and the Hydra::Works model is that the latter currently explicitly disallows Collections from aggregating FileSets (lines 9 and 23):

https://github.com/projecthydra-labs/hydra-works/blob/master/lib/hydra/works/models/concerns/collection_behavior.rb

I'm at a loss for why we did that, tbh. Tagging some other folks who might remember why we shouldn't allow this: @jpstroop @jcoyne @tpendragon @escowles @elrayle @awead

jcoyne commented 8 years ago

I'm guessing this was more of an application concern for sufia/CurationConcerns where we didn't want FileSets to be objects that appear in search results, get transferred, etc.

azaroth42 commented 8 years ago

The use case for FileSets associated with Collections is when you have a thumbnail or other representations of the collection itself, rather than simply selecting a primary representative resource from the collection's members. Imagine a collection of 50 image objects, then a single image with all 50 members and their labels should be associated with the collection, but not as a member object of the collection -- the count of members is still 50, not 51.

In PCDM this could be just a hasFile relationship, but in Works I believe we want to maintain consistency with the FileSet notion to have derivatives grouped together, and always look in the same place for use/role information.

no-reply commented 8 years ago

This is all looking good to me, and its treatment in #20 is helpful.

I want to point to the issue in #21 as a way to draw the line between the space of PCDM classes and that of "Real World Objects". The interpretation I would suggest is:

A PCDM Object (say a "Book") in HyBox can be considered a repository-specific representation of an actual "Book".
Statements about the PCDM Object may be statements about the real thing, but the semantics of the RWO's inheritance of those properties is undefined.

(There's a parallel to SKOS's handling of concepts: statements about a skos:Concept like :cat don't say anything about any real world class of cats, or any particular cat. A Concept may also be a Class, or an Instance, or whatever else, but then the user is on their own ensuring consistent semantics and not accidentally saying things like "Fluffy is narrower than the class of all cats").

So a pcdm:Object can have descriptive metadata without muddying up the RWO distinction, but the door is open for the RWO to have, e.g., a different title than the repository object ("Moby Dick (scanned 2016)" isn't an unmitigated disaster).

elrayle commented 8 years ago

This is a long thread and I admit to not having fully digested everything. I will try to read it more thoroughly later today. The following is a larger example of a model that I put together before Hydra Connect to show collections, works, and filesets in context. From my quick skim of this thread, it seems in line with the direction of the conversation.

work_pages_full_ex

NOTES:

There is a single work for all pages in the book. This allows for ordering of the pages without the full book pdf getting in the way. It also allows a IIIF reader access to the set of pages.
You could also add chapters, sections, and other book structures as works with pages. Ideally, the page filesets are in the system once and are members of multiple works (e.g. all pages, chapter X, etc.)
Collection thumbnails are handled as related object.

escowles commented 8 years ago

I think Tom's comment gets at a practical solution: we can acknowledge that there is a difference between the RWO book and our digital object book, but the vast majority of people can elide that difference and attach the few properties they have about the physical book (e.g., its size) to the digital object. People who want to do more sophisticated things will need to be careful about the semantics, but we'd expect people with specific use cases around describing RWOs to be careful and thoughtful about that anyway.

This is also my position on having a Page object separate from a FileSet: Of course there is an intellectual Page object (and a Page RWO for that matter). But in the vast majority of cases, we should just attach any description of those to the FileSet.

azaroth42 commented 8 years ago

To me, this falls under the Freedom from Choice principle. The end users don't need to make a choice about whether there are two separate resources that manage the information, they're still going to get a combined view of the data and a combined form for data entry. Developers don't need to make a choice about whether to have one object or two, or worse to migrate from one to two when they find they made the wrong choice because they didn't fully understand the use cases. And they then don't need to implement the tests to determine which state each particular object is, as there's a consistent pattern. The costs of the decision are able to be mitigated through good engineering and enhancements to the underlying storage platform.

Also ... for regular Hydra shops with their own developers (at least one), there is the option to make changes like that to satisfy use cases. However for HyBox there isn't this opportunity. Whichever way we go is what everyone has to use... there's no option to be more sophisticated unless we make the decision to allow it now.

no-reply commented 8 years ago

@escowles :speech_balloon:

I think Tom's comment gets at a practical solution: we can acknowledge that there is a difference between the RWO book and our digital object book, but the vast majority of people can elide that difference and attach the few properties they have about the physical book (e.g., its size) to the digital object.

:+1:. And the "RWO" need not be "physical" in the usual sense. It might be a multi-year run of a play, a particular musical performance, an ebook, etc... "physical" vs. "abstract" or "digital" isn't the distinction, so much as "thing" vs. "repository realization".

Many repository maintainers are going to be quite happy with a "repository realization" only; those that aren't get a nice clean break and a lot of flexibility in how they can connect their "real world" model/ontology back to the repository.

@escowles :speech_balloon:

This is also my position on having a Page object separate from a FileSet: Of course there is an intellectual Page object (and a Page RWO for that matter). But in the vast majority of cases, we should just attach any description of those to the FileSet.

I'm :+1: on this if and only if we have a way to ensure compatibility for the use cases where the Page object is required. @azaroth42's call for "Freedom of Choice" seems important, here.

escowles commented 8 years ago

I like the Freedom of Choice principal, but not if the logical outcome of that train of thought is to use the most complex option in every case.

Given the nature of the project and the timeframe, I've been assuming that one of the design principles was to go with the simpler solution that worked for the vast majority of cases.

I think we should keep the current Work -> FileSet hierarchy, and find a way to add descriptions of RWOs on top of it, rather than adding it into the model whether the user has a use for it or not.

azaroth42 commented 8 years ago

"physical" vs. "abstract" or "digital" isn't the distinction, so much as "thing" vs. "repository realization".

:+1:

I like the Freedom of Choice principal, but not if the logical outcome of that train of thought is to use the most complex option in every case.

I'm not sure that one layer of separation is "the most complex option" :) Indeed, Lynette's model even has a "The Raven Pages" Work separate from "The Raven Work", which isn't present in mine. Also, the current simplicity of munging everything together makes life much harder down the line when richer models around the "thing" are available. You can't just swap out the resources for CIDOC or BibFrame, you'd need to go through and strip all the properties off of the repository resources.

That said, I'd very much like to understand the negatives, other than the performance overhead for F4, of including the extra layer? Could we brainstorm a list of pros and cons?

tpendragon commented 8 years ago

I'm not even thinking about the complex cases where the separation of intellectual from real world matters, really. Is there an easier to implement solution that solves the use case in this issue (a book with pages and a PDF of that book)?

escowles commented 8 years ago

@tpendragon I think you could do either:

Have a FileSet for each page plus a FileSet for the PDF, and use mime type to separate them.
Have a FileSet for the PDF, and a child Object to hold the page image FileSets.

tpendragon commented 8 years ago

@escowles The semantics for option 1 seems odd to me - mime doesn't feel like a good relationship designator. Option 2 might work, but you'd have to standardize again I think - parts go in a child object.

no-reply commented 8 years ago

Re: the mimetype option (or similar metadata based approaches): How do we determine which mime type presents which behavior? This seems to hide structural information in a non-structural property.

no-reply commented 8 years ago

@escowles :speech_balloon:

Have a FileSet for the PDF, and a child Object to hold the page image FileSets.

Is this genuinely easier to implement than a "part" object for each page? If so, is the reason mainly that it reduces the number of HTTP round trips?

escowles commented 8 years ago

It reduces both the number of Objects and number of HTTP requests. Particularly on objects with many pages, this could be hundreds of extra objects and thousands of extra HTTP requests.

Having an extra object to represent RWO or RW collections or to group the page image FileSets together adds a handful of extra objects. But adding an extra object for every file adds many, many more.

azaroth42 commented 8 years ago

To channel @cbeer from courtyard discussion ... the model can make the distinction, and the implementation can (even today) use # URIs to avoid the HTTP request overhead. Then when there's a technology solution for the overhead (e.g. something like LDP-batch), there's a trivial transition rather than an impossible one.

The request overhead issue is well known ... are there other concerns about the separation?

escowles commented 8 years ago

@azaroth42 it also seems superfluous to have a separate Page object, because we already have a FileSet which can be used to hold descriptive metadata about the page. But, I fully admit that's based on the use cases I've worked on and not on the prospective HyBox user input.

tpendragon commented 8 years ago

I agree that the model and the implementation can be different, and I don't want to get too deep into implementation, but # URIs won't work here (Hash URIs can't have contained resources, so you can't make the FileSet a # URI. They work great for leaf nodes.)

hybox / models

Object with Files and FileSets? #17