Closed azaroth42 closed 8 years ago
I think Hydra::Works breaks this--maybe. In pure PCDM, the Book is a pcdm:Object
, with files representing it in its entirety, like a PDF, associated via pcdm:hasFile
. Each page is also a pcdm:Object
, with associated files representing the page only.
I'm not sure how you do this in Hydra::Works. Maybe there's a FileSet that's part of the work but not part of the order? :-(
A naïve Hydra::Works-based model (w/o ordering represented...) could look like this:
Maybe this is what @jpstroop was :frowning: about!
Right, pure PCDM is option 1... Object with files and more objects.
In Works, a FileSet that's not part of the order but is a member is (3) or (3b)
Or the Object also becomes a FileSet, as it has a set of files plus the member FileSets ... which is (2). For the record, I don't like (2).
The version from Mike is (4) ... Introduce another Object to represent the page separate from the FileSet that holds the Files. This is what we decided /against/ at the SD HDC ... no need for the object that becomes the Canvas separate from the FileSet that holds the images (and other content).
If (4) is still on the table, it's (still) my preference.
@azaroth42 @jpstroop Can you refresh my memory about the downside of (4)? Is it just that the Page-level Work there is not necessary to make the mapping to IIIF (in which case, OK but ¯\_(ツ)_/¯
:question:) or is there more to it than that?
I believe (Jon correct me if I'm misremembering) it's that the Page object would be just another object to maintain a URI for without much value... you can add label and so forth to the FileSet directly, such that it works out at least 95% of the time.
There is some value, but mostly the 5% cases that Shared Canvas deals with:
The main value of the extra PCDM object is that it aligns well with the Hydra::Works model -- we can do this using current codebases without making a single modification (right?). I'm not saying that concern overrides all others, but I'd just toss that in as a valuable thing to keep in mind. (Have I mentioned that I'm a fan of models for which there's already working code? :grinning: )
I believe (Jon correct me if I'm misremembering) it's that the Page object would be just another object to maintain a URI for without much value... you can add label and so forth to the FileSet directly, such that it works out at least 95% of the time.
I honestly can't remember why--maybe the restrictions of H::W just make it feel like kludge? I suppose I agree that (4) is the way to go if we don't want to bend H::W. I think @tpendragon was in on that conversation too. Maybe he remembers something?
I've got a lot of thoughts and it's hard to organize them in my head, so I'm just gonna stash them all here and let them be commented upon.
we can do this using current codebases without making a single modification
, but we have no UI or interaction for actually doing this. Even in Plum it would be hard - LOTS of clicks. HOWEVER, if everything's modeled this way then a computer can have a generally sound interaction model with all things hydra:works and be able to reason about what to do at each layer.Works are Canvases if they only have FileSets
, and even more (although this probably should have always been the case) Works are Canvases if they only have FileSets which have an original File that's an image according to its technical metadata
. Although maybe that's not right either? When you have a Collection which has a Work that has a PDF and images, should the PDF be on the canvas via something like IxIF or should it be its child works?In summary, I THINK I prefer option 3 for "thing we can do easiest and I can imagine a UI for", but if this expanded model is something we can use everywhere (and my description is accurate) then maybe it's worth the expense, and it would make @cmh2166 happy I think.
Oh, also
3) In option 4, is the FileSet ordered? Prooobably not, and maybe that's where the IIIF mapping works out.
Thanks @tpendragon !
rendering
property (http://iiif.io/api/presentation/2.1/#rendering) for resources that represent the entire thing, rather than individual views.In terms of UI/UX, I don't think that the system should try to mirror the model directly into the UI. In particular, the distinction between RWO and the digitized object is important for some cases, but the vast majority of the properties can apply to one or the other -- a single form could be used to capture the information, regardless of where that information ends up in the persistence layer that implements the model. It might also be good to have admin configurable templates for these, such as setting up controlled vocabularies to use, fields to hide/show, fields to auto-populate, and so on.
While I catch up on this thread, to be clear about my own intent: I am :100: OK with bending Hydra::Works or just using Hydra::PCDM, or another Hydra::PCDM-based library, for our needs here. I thought it prudent to have at least one option that is the naïve Hydra::Works model.
I think the overall model looks something like:
And the mapping for IIIF falls only on the Digital Object side of the line, with metadata copied across from the RWO side. The mapping would be:
Collection --> Collection Object --> Manifest (order of object) --> Sequence Part --> Canvas FileSet --> Annotation File --> Content TechMD --> info.json for images
And then FileSet for an Object or Collection would be rendering, rather than an Annotation.
The provenance/history/versioning features of the repository likely wouldn't be put into the IIIF, but are important to capture for HyBox on the digital side.
:+1:. It's just determining when a Work is a Part, yeah? It may just be "if it's ordered". There's also an issue where width/height are required for the canvas, and if there's no FileSet then there's no width/height...probably
Yes, determining when a work is a part is key. Given that in IIIF all canvases must be part of a sequence, the ordered-ness is a great way to do that. :+1: from me.
The part should be able to record the h/w in the event that there isn't a fileset. And provides a resource to hang other properties off like the description/note that the back of a photograph has a signature on it, but there's no image that depicts it.
Thanks for cc'ing me, @tpendragon.
I like the overall model @azaroth42 just shared (+1 to option 4 from my viewpoint), and my questions are more about what relationship we use to connect HW:Work/PCDM:Object to RWO (or whatever other instances in other repositories/domain models we want to link to - WEMI Works, CHOs, etc.), and if that RWO/CHO/WEMI instance gets representation somehow in Fedora/PCDM.
FWIW in this discussion, we've got some edge cases currently where we want to make descriptive metadata assertions on a HW:Work/PCDM:Object that stands in for the Page (basically migrating legacy functionalities). @azaroth42 covers similar needs in his comments.
Here is a simple overview of what I'd hoped we would do (we've gone the route of conflating of the dpla:SR instance/stand in for descriptive metadata on the WEMI Work with the Book PCDM:Object). The red writing are the very specific metadata needs that made me originally look at this idea.
Hope this helps, and that I didn't misunderstand the discussion so far. Thanks for all you guys do.
@cmh2166 -- our graphs look isomorphic, which is encouraging that independent analysis came to the same result :)
My thinking is that for HyBox we would want to include one level of RWO for Object and Collection to act as the resource that maintains the descriptive metadata. This would be a relatively simple resource that can be replaced or extended as desired on a case by case basis, without breaking the digital object model. If someone wants to make a module for BibFrame or CIDOC-CRM or BIBO or ... then there are clear anchoring points across that black line.
We should discuss next week at LDCX :D
@cmh2166 We should talk about the RWO vs intellectual work split at LDCX, because if you include "file on disk" I think there's three levels, there, versus the two that would be here.
@azaroth42 - Absolutely, and I admittedly am sharing here for a different context than what you've got for Hybox (and I know nothing about IIIF other than it seems pretty rad and I wish I had a reason to be involved with it).
@tpendragon - yes, I agree. From my perspective, it's a question of what metadata domain models we want to bring over to pcdm representation versus deciding pcdm represent digital objects/collections and building bridges to those other domain models as described/stored elsewhere.
So, I like generally where this is going with option 4, feel bad about the performance inefficiencies though this can create, and I still have questions for Hydra::Works/PCDM more broadly. We can def discuss next week.
I'll let you guys get on with planning the hybox revolution.
@tpendragon Good point! pcdm:File is the file, so the pointer doesn't really exist in @cmh2166's diagram. I skipped over that. Also the OCR file currently on the AF:Book would live inside a FileSet. So maybe not entirely isomorphic yet
@azaroth42 OCR is a lingering question (where to put and how). The file pointers to AWS will be added in an upcoming test instance (right now, we're just storing files in fedora for sake of testing all this in a sandbox). Any advice on how to handle those pointers, +1.
We have the same issue at Stanford for very large files (video and web archives in particular) where we will need Fedora4 to somehow manage the metadata for content that isn't directly "in" Fedora. It's (IMO) an important question, as the distinction is between an RDFSource and a NonRDFSource (in LDP terms). I've been assuming that this will work itself out, but it needs to be scheduled (tag @cbeer @mjgiarlo @hannahfrost @anarchivist) as a not insignificant piece of work
@azaroth42's list of reasons to have a Page object separate from the FileSet seems reasonably convincing to me. I don't think all of those absolutely need a separate Page object, but it would certainly make them more elegant.
Is the intention to always use separate Page objects, or to use them for the 5% of cases that need them?
I would prefer to always use them, rather than have to have the developer / admin / whoever make a choice. And then have to test for the results of that choice all the time in the code.
From the SD HDC, if we can use batch ops to F4 to speed up some of these interactions -- both create and retrieve -- I would like to believe that the cost of the additional objects will be relatively low.
I would prefer to always use them, rather than have to have the developer / admin / whoever make a choice.
:+1: Branches are the death of productivity.
@tpendragon :speech_balloon:
Branches are the death of productivity.
:+1:
Though, AFAICT, the current F4 batch operations draft spec doesn't anticipate fewer HTTP requests -- it's really a refinement of the current transaction support to accept or abandon a set of changes, not a way to doing a bunch of operations in a single request.
@tpendragon :speech_balloon:
@mjgiarlo says we can do this using current codebases without making a single modification, but we have no UI or interaction for actually doing this.
Fair point. I was assuming a near-term future where these tickets blocking the Sufia 7.0.0 release were already done. Sorry, product owner blinders were on. :wink:
@azaroth42 :speech_balloon:
We have the same issue at Stanford for very large files (video and web archives in particular) where we will need Fedora4 to somehow manage the metadata for content that isn't directly "in" Fedora.
Is this a HyBox need or a different need? Which of our many needs are you referencing here, @azaroth42? :wink:
Not a need that has been identified for HyBox to my knowledge, but certainly one that has come up at Stanford. However, if the median repo size is on the order of 5TB, I do wonder whether it /is/ actually a HyBox need? We could push it to Business to decide?
Meh, I don't think it is a HyBox need at the moment so we should punt. Thanks for clarifying!
Confirmed the sketch in https://github.com/hybox/models/issues/17#issuecomment-196919583 with @hannahfrost as being acceptable, if it is to others :)
@azaroth42 No objections. The only discrepancy I see between that sketch and the Hydra::Works model is that the latter currently explicitly disallows Collections from aggregating FileSets (lines 9 and 23):
I'm at a loss for why we did that, tbh. Tagging some other folks who might remember why we shouldn't allow this: @jpstroop @jcoyne @tpendragon @escowles @elrayle @awead
I'm guessing this was more of an application concern for sufia/CurationConcerns where we didn't want FileSets to be objects that appear in search results, get transferred, etc.
The use case for FileSets associated with Collections is when you have a thumbnail or other representations of the collection itself, rather than simply selecting a primary representative resource from the collection's members. Imagine a collection of 50 image objects, then a single image with all 50 members and their labels should be associated with the collection, but not as a member object of the collection -- the count of members is still 50, not 51.
In PCDM this could be just a hasFile relationship, but in Works I believe we want to maintain consistency with the FileSet notion to have derivatives grouped together, and always look in the same place for use/role information.
This is all looking good to me, and its treatment in #20 is helpful.
I want to point to the issue in #21 as a way to draw the line between the space of PCDM classes and that of "Real World Objects". The interpretation I would suggest is:
(There's a parallel to SKOS's handling of concepts: statements about a skos:Concept
like :cat
don't say anything about any real world class of cats, or any particular cat. A Concept may also be a Class, or an Instance, or whatever else, but then the user is on their own ensuring consistent semantics and not accidentally saying things like "Fluffy is narrower than the class of all cats").
So a pcdm:Object
can have descriptive metadata without muddying up the RWO distinction, but the door is open for the RWO to have, e.g., a different title than the repository object ("Moby Dick (scanned 2016)" isn't an unmitigated disaster).
This is a long thread and I admit to not having fully digested everything. I will try to read it more thoroughly later today. The following is a larger example of a model that I put together before Hydra Connect to show collections, works, and filesets in context. From my quick skim of this thread, it seems in line with the direction of the conversation.
NOTES:
I think Tom's comment gets at a practical solution: we can acknowledge that there is a difference between the RWO book and our digital object book, but the vast majority of people can elide that difference and attach the few properties they have about the physical book (e.g., its size) to the digital object. People who want to do more sophisticated things will need to be careful about the semantics, but we'd expect people with specific use cases around describing RWOs to be careful and thoughtful about that anyway.
This is also my position on having a Page object separate from a FileSet: Of course there is an intellectual Page object (and a Page RWO for that matter). But in the vast majority of cases, we should just attach any description of those to the FileSet.
To me, this falls under the Freedom from Choice principle. The end users don't need to make a choice about whether there are two separate resources that manage the information, they're still going to get a combined view of the data and a combined form for data entry. Developers don't need to make a choice about whether to have one object or two, or worse to migrate from one to two when they find they made the wrong choice because they didn't fully understand the use cases. And they then don't need to implement the tests to determine which state each particular object is, as there's a consistent pattern. The costs of the decision are able to be mitigated through good engineering and enhancements to the underlying storage platform.
Also ... for regular Hydra shops with their own developers (at least one), there is the option to make changes like that to satisfy use cases. However for HyBox there isn't this opportunity. Whichever way we go is what everyone has to use... there's no option to be more sophisticated unless we make the decision to allow it now.
@escowles :speech_balloon:
I think Tom's comment gets at a practical solution: we can acknowledge that there is a difference between the RWO book and our digital object book, but the vast majority of people can elide that difference and attach the few properties they have about the physical book (e.g., its size) to the digital object.
:+1:. And the "RWO" need not be "physical" in the usual sense. It might be a multi-year run of a play, a particular musical performance, an ebook, etc... "physical" vs. "abstract" or "digital" isn't the distinction, so much as "thing" vs. "repository realization".
Many repository maintainers are going to be quite happy with a "repository realization" only; those that aren't get a nice clean break and a lot of flexibility in how they can connect their "real world" model/ontology back to the repository.
@escowles :speech_balloon:
This is also my position on having a Page object separate from a FileSet: Of course there is an intellectual Page object (and a Page RWO for that matter). But in the vast majority of cases, we should just attach any description of those to the FileSet.
I'm :+1: on this if and only if we have a way to ensure compatibility for the use cases where the Page object is required. @azaroth42's call for "Freedom of Choice" seems important, here.
I like the Freedom of Choice principal, but not if the logical outcome of that train of thought is to use the most complex option in every case.
Given the nature of the project and the timeframe, I've been assuming that one of the design principles was to go with the simpler solution that worked for the vast majority of cases.
I think we should keep the current Work -> FileSet hierarchy, and find a way to add descriptions of RWOs on top of it, rather than adding it into the model whether the user has a use for it or not.
"physical" vs. "abstract" or "digital" isn't the distinction, so much as "thing" vs. "repository realization".
:+1:
I like the Freedom of Choice principal, but not if the logical outcome of that train of thought is to use the most complex option in every case.
I'm not sure that one layer of separation is "the most complex option" :) Indeed, Lynette's model even has a "The Raven Pages" Work separate from "The Raven Work", which isn't present in mine. Also, the current simplicity of munging everything together makes life much harder down the line when richer models around the "thing" are available. You can't just swap out the resources for CIDOC or BibFrame, you'd need to go through and strip all the properties off of the repository resources.
That said, I'd very much like to understand the negatives, other than the performance overhead for F4, of including the extra layer? Could we brainstorm a list of pros and cons?
I'm not even thinking about the complex cases where the separation of intellectual from real world matters, really. Is there an easier to implement solution that solves the use case in this issue (a book with pages and a PDF of that book)?
@tpendragon I think you could do either:
@escowles The semantics for option 1 seems odd to me - mime doesn't feel like a good relationship designator. Option 2 might work, but you'd have to standardize again I think - parts go in a child object.
Re: the mimetype option (or similar metadata based approaches): How do we determine which mime type presents which behavior? This seems to hide structural information in a non-structural property.
@escowles :speech_balloon:
Have a FileSet for the PDF, and a child Object to hold the page image FileSets.
Is this genuinely easier to implement than a "part" object for each page? If so, is the reason mainly that it reduces the number of HTTP round trips?
It reduces both the number of Objects and number of HTTP requests. Particularly on objects with many pages, this could be hundreds of extra objects and thousands of extra HTTP requests.
Having an extra object to represent RWO or RW collections or to group the page image FileSets together adds a handful of extra objects. But adding an extra object for every file adds many, many more.
To channel @cbeer from courtyard discussion ... the model can make the distinction, and the implementation can (even today) use # URIs to avoid the HTTP request overhead. Then when there's a technology solution for the overhead (e.g. something like LDP-batch), there's a trivial transition rather than an impossible one.
The request overhead issue is well known ... are there other concerns about the separation?
@azaroth42 it also seems superfluous to have a separate Page object, because we already have a FileSet which can be used to hold descriptive metadata about the page. But, I fully admit that's based on the use cases I've worked on and not on the prospective HyBox user input.
I agree that the model and the implementation can be different, and I don't want to get too deep into implementation, but # URIs won't work here (Hash URIs can't have contained resources, so you can't make the FileSet a # URI. They work great for leaf nodes.)
@no-reply @mjgiarlo @cbeer @jpstroop
In Scenario 5, there is a digital object (a book) with its own files (a PDF and OCR'd text) as well as a set of components (pages) with their own files (TIFF and OCR).
There seem to be several options here: