Use Case: Page level representation (parts of Works)

azaroth42 commented 10 years ago

A multi-page text (such as a book, magazine, article, or similar) may be represented by a variety of different resources with different uses. Contributed towards refining #8

Given a simple object that could be rendered in a page turning interface:

The Work
- is-described-by Descriptive metadata about the Work
- has-representation PDF of the entire Work
- has-part-order [Page 1, Page 2, Page 3, ...]
- has-part Page 1
- is-described-by Descriptive metadata about Page 1
- has-representation JPG Access copy of digitized Page 1
- has-representation JP2/Tiff Master copy of digitized Page 1
- has-representation ALTO XML of Page 1's textual content
- has-part Page 2
- ...

jcoyne commented 10 years ago

I think we have a few potential use cases which need to be fleshed out:

A work has descriptive metadata
A work has a representation
A work has multiple parts
The parts may have multiple representations each with different purposes
A work has multiple parts in a particular order

jpstroop commented 10 years ago

What about technical, provenance, etc. metadata about each of the 'representations'? At a minimum there are checksums and a media type for each, but what about software, hardware, events, etc.. If each representation were its own GenericFile, this wouldn't be problem and the sort order (or at least the canonical one) could be a property of that.

Otherwise you're likely going to wind up hacking the property/stream relationships into their respective names, (e.g. jp2_md5_checksum) and...hopefully that speaks for itself :smile:.

jcoyne commented 10 years ago

@jpstroop Sounds good, submit a use case for that?

jcoyne commented 10 years ago

@jpstroop I think I get what you are saying: Represenations and Parts could all be the same entity. Right?

jpstroop commented 10 years ago

Yes, I think so. For example, a Representation of a Book has many Representations of a Page.

jcoyne commented 10 years ago

@escowles @mcritchlow is a "Part" as described above, what you call a "Component"?

jcoyne commented 10 years ago

@jpstroop I like that. I started thinking about RDF list and I could hear @no-reply sighing 1700 miles away.

jpstroop commented 10 years ago

(and, furthermore, Multi-Volume Booksets have Books that sort in a certain order, etc.).

@jpstroop is wary of unnecessary recursion, but I think it feels right in this case.

jpstroop commented 10 years ago

Will get a use case in tonight or tomorrow--should it live in this thread or start a new issue?

jcoyne commented 10 years ago

@jpstroop New issue please! Even better is to put one in ./use-cases as described in the CONTRIBUTING document.

mcritchlow commented 10 years ago

@jcoyne Yes, Part as described here aligns closely with our Components. We were envisioning that our Components in the Hydra:Works context will align pretty well with GenericFile.

mcritchlow commented 10 years ago

@jpstroop "If each representation were its own GenericFile, this wouldn't be problem and the sort order (or at least the canonical one) could be a property of that."

:+1: In our currently proposed new model, we have partName and partNumber properties for this. At the moment they're slated as local ucsd properties, but perhaps there's a convention to agree on here.

azaroth42 commented 10 years ago

I disagree with @jpstroop (for once! mark your calendars, folks!) There needs to be a separation of Work, Representation-of-Work, Page and Representation-of-Page because X to Representation-of-X is a 1:N relationship, in the same way that Work to Page is 1:N.

In Pseudo-RDF:

_:book a h:Work;
    h:hasBitstream <PDF> ;
    h:hasComponentList [ _:page1, _:page2, _:page3] ;
    h:isDescribedBy <book-metadata> .

_:page1 a h:Component;
    h:hasBitstream <JPG> ;
    h:hasBitstream <ALTO-XML> ;
    h:hasBitstream <masterTiff> ;
    h:isDescribedBy <page-metadata> .  

<JPG> a h:Bitstream ;
    h:isDescribedBy <ImageTechnicalMetadata> .
...

In the above, is a representation of the Work (Book), and , and are all representations of the Component (Page). They can all have their own separate metadata.

Will graffle it up tomorrow.

escowles commented 10 years ago

This brings up two questions we've been grappling with in our data model:

Does a Work need to link directly to a datastream (the PDF in @azaroth42 's example above), or can we live with always having a Component between Works and Bitstreams?
Since we have hierarchical relationships of Components (see #11), does Component need to contain other Components (in the LDP/Fedora4 sense), or is linking to them to encode the hierarchical relationships good enough?

jcoyne commented 10 years ago

@escowles I think you are getting into implementation detail at this point, which we might want to be wary of. However, you've enticed me, so in my opinion Components should be a completely optional part of the model. I think they make the simplest models overly complex. I'm thinking of a use case like: "There is a university publication and the only representation of it is a single PDF"

jpstroop commented 10 years ago

Coming from #8, here. I don't think @azaroth42 and I are in disagreement...only that we're thinking at different levels of detail. (Note that I'm keeping the name Hydra::Work here, but I really don't think that's the best name for it -- maybe DigitalObject?).

Consider this, in pseudo-Ruby

Hydra::Work
  has_many :files, class: Sufia::GenericFile
  has_many :things, class: Hydra::Work
  has_metadata ....

Or, in pseudo-RDF (based on @azaroth42's example):

# Book and Component are extensions of Work
_:Book;
    rdfs:subClassOf h:Work .
_:Component;
    rdfs:subClassOf h:Work .

# So an instance could look like:
_:book a :Book;
    h:hasGenericFile <PDF> ;
    h:hasComponent _:page1;
    h:hasComponent _:page2;
    h:hasMetadata <book-metadata> .

_:page1 a :Component;
    h:hasGenericFile <JPG> ;
    h:hasGenericFile <ALTO-XML> ;
    h:hasGenericFile <masterTiff> ;
    h:hasMetadata <page-metadata> ;
    x:sort 1 .

<PDF> a s:GenericFile ;
    h:hasMetadata <ImageTechnicalMetadata> .
<JPG> a s:GenericFile ;
    h:hasMetadata <ImageTechnicalMetadata> .

Again. I'm wary of recursion, but in this case Hydra::Work itself would be a somewhat dumb base class that only makes recursion possible for the sake of letting extensions of it apply constraints that fit their domain, and behaviours like sorting could be mixed in as needed, etc.

Is this too high level / abstract?

jcoyne commented 10 years ago

@jpstroop That's a wonderful example. This is very close to what I'm thinking. Only I hadn't considered making Component a subclass of Work. I'm wondering if anyone has a counter example for this? If this sort of model is adopted, I do think we need to urge restraint about using hierarchy. Just because it's there doesn't mean you should always use it.

escowles commented 10 years ago

I think my only issue with @jpstroop's example is that sufia:GenericFile seems overly heavy for the task at hand. I don't think the File class needs to have any descriptive metadata, but instead should just be a bitstream and associated technical metadata.

escowles commented 10 years ago

@jpstroop @jcoyne I think the inheritance might work out better the other way around: DigitalObject/Work could be a subclass of Component that adds links to Collections and whatever else is reserved for DigitalObject/Work alone.

jpstroop commented 10 years ago

Everyone is going to have parts of things that have more or less the same requirements as the things themselves.... Book sets, Books, Pages, Annotations, Archival Collections with Series/Files/Items.

@escowles, I agree, descriptive metadata about a GenericFile wouldn't Respect The Model :tm: in this case. Could it be constrained by defining a few different hasMetadata subproperties (descriptive, technical, rights, provenance are the ones that come to mind) with different domains? I don't know if I like the idea, but it's one way of solving the problem.

jcoyne commented 10 years ago

In a simpler use case, one without Components involved (e.g. self-deposit), it's vital for the individual files to be able to support descriptive metadata. Where else would it go?

escowles commented 10 years ago

@jcoyne In the simple case, you'd just have a DigitalObject (with descriptive metadata) and a single File (with technical metadata + bitstream). If you have more than one File, wouldn't you want to have a Component for each of them?

jpstroop commented 10 years ago

Yeah, what @escowles said. Or Hydra::GenericFile < Sufia::GenericFile, but that would just get confusing.

jcoyne commented 10 years ago

@escowles So you are saying if a Component/DigitalObject ever has more than one GenericFile, then it should make a Component for each file?

escowles commented 10 years ago

@jcoyne Yes, if the Files are different enough to need different descriptive metadata, then I think there should be separate Components to hold that.

jpstroop commented 10 years ago

@jcoyne sorry, missed your question before:

If this sort of model is adopted, I do think we need to urge restraint about using hierarchy. Just because it's there doesn't mean you should always use it.

Sure, I suppose so--either by demonstration (in exemplar subclasses) or documentation, but most people with any experience should know that. And there are cases (archives come to mind, or maybe annotations on annotations (@azaroth42, is that a thing?)) where theoretically infinite recursion is actually what you want. We're just giving you enough rope....

azaroth42 commented 10 years ago

An institutional repository equivalent for the book example would be a paper that has several figure images that are supplied along with the final PDF. The professor provides the data that was used to create the figure, and the repository creates and manages a thumbnail for display on the paper's splash page.

Thus:

_:Paper a h:Work ;
  h:hasGenericFile <PDF> ;
  h:hasComponent _:figure1, _:figure2 ;
  h:hasMetadata <Metadata> .

_:figure1 a h:Component ;
  h:hasGenericFile <Image-Figure1>, <Thumbnail-Figure1>, <Excel-Figure1>,
  h:hasMetadata <Figure1Metadata> .

azaroth42 commented 10 years ago

Agree that files should not have descriptive metadata, that should be applied to the Work or Component. If there's significant different descriptive metadata, then it should be a new Work/Component.
Files should be able to have their own rights metadata and technical metadata.
Thus there's no need for a component for every file. Components should group files together (per @jpstroop and my examples)
Question: Are collections/lists/sets of Works in scope for this discussion?
Annotations on annotations are definitely a thing, but I'm not certain that an annotation is a Work in this sense. Archives are a better use case here, IMO.

jpstroop commented 10 years ago

Question: Are collections/lists/sets of Works in scope for this discussion?

My sense is yes, and this is my problem with "Work" as a label. A Set or Collection is just a (sub)type of Work, at least in terms of what I'm proposing.

jpstroop commented 10 years ago

Annotations on annotations are definitely a thing, but I'm not certain that an annotation is a Work in this sense. Archives are a better use case here, IMO.

Why not? They have creators, provenance, dates, potentially rights...

azaroth42 commented 10 years ago

Question: Are collections/lists/sets of Works in scope for this discussion? My sense is yes, and this is my problem with "Work" as a label. A Set or Collection is just a (sub)type of Work, at least in terms of what I'm proposing.

I'm fine with that being the case, but I suspect @jcoyne may push back in terms of simplicity so wanted to make sure we're on the same page. #8 also probably cares about this.

escowles commented 10 years ago

@azaroth42 We always use a Component when we have multiple files, in part to group derivatives. So for your paper-and-figures example, we would do:

_:Paper a h:Work ;
  h:hasComponent _:paper1, _:figure1, _:figure2 ;
  h:hasMetadata <Metadata> .

_:paper1 a h:Component ;
  h:hasGenericFile <PDF-Paper1>, <Thumbnail-Paper1>, <LaTeX-Paper1>,
  h:hasMetadata <Paper1Metadata> .

_:figure1 a h:Component ;
  h:hasGenericFile <Image-Figure1>, <Thumbnail-Figure1>, <Excel-Figure1>,
  h:hasMetadata <Figure1Metadata> .

azaroth42 commented 10 years ago

Annotations on annotations are definitely a thing, but I'm not certain that an annotation is a Work in this sense. Archives are a better use case here, IMO. Why not? They have creators, provenance, dates, potentially rights...

I'm not against annotations being considered, tagging @jcoyne and #8 again.

They don't necessarily have any associated bitstreams, or even content in the case of a bookmark or highlight
They have a model already
The hierarchy of annotations annotating other annotations isn't conceptually part-of or member-of, it's more related-to. So the infinite recursion is less likely to be an issue as they wouldn't be nested within the same Work. (Modulo the collections as Works notion, and the possibility of sets of annotations)

azaroth42 commented 10 years ago

@escowles Gotcha. My thoughts there would be to treat the Work as the component for grouping the full representations together. It would also need some other relationship to related files such as additional material that shouldn't be grouped together like that.

_:Paper a h:Work ;
  h:hasComponent _:figure1, _:figure2 ;
  h:hasGenericFile <PDF-Paper1>, <Thumbnail-Paper1>, <LaTeX-Paper1> ;
  h:hasRelatedFile <AdditionalMaterial1> ;
  h:hasMetadata <Metadata> .

_:figure1 a h:Component ;
  h:hasGenericFile <Image-Figure1>, <Thumbnail-Figure1>, <Excel-Figure1> ;
  h:hasMetadata <Figure1Metadata> .

But I can see the utility of having the additional level of abstraction. I would prefer in your model to have the Figure be part of the paper component, rather than flattening out into the top level Work.

jcoyne commented 10 years ago

Question: Are collections/lists/sets of Works in scope for this discussion?

Yes, for some definitions of "collections/lists/sets". ;) We should probably move that discussion somewhere else.

mjgiarlo commented 10 years ago

I'm catching up on numerous threads at the same time so the confusion may be only mine, but I'll ask: I've seen some folks suggest that GenericFiles and Components are sufficiently similar and others suggest that the the model we're developing needs Components in addition. Could someone point me at an IR-like use case where the Collection -> Work -> GenericFile model, as currently implemented in Worthwhile, wouldn't work?

azaroth42 commented 10 years ago

Work is a paper. The work has a representation which is a PDF. The Work contains a Figure. The Figure has two representations, one of which is an Image, the other is a CSV file.

The Work should not contain the PDF, the Image and the CSV as they are not all representations of the same Resource.

mjgiarlo commented 10 years ago

Thanks for that, @azaroth42. At a conceptual level, I agree with you. From a pragmatic standpoint, I don't think many of our (Penn State, ScholarSphere, other current Sufia implementions) users will care about that distinction; they'd be happy to upload a new work that contains three files: the PDF, the image, and the CSV file. (So it'd be a work with descMD plus three GenericFiles, each with one bitstream.)

Is this use case currently supported by Hydrus or any of our other products? (I'm not trying to suggest it's not a relevant concern, @azaroth42, but I do wonder: if this isn't already implemented in one of our products, can we punt this off to a future phase once we've made progress on our short-term goals?)

azaroth42 commented 10 years ago

Okay, then how about server generated derivatives?

I think what you're proposing is that the Work would contain a big ole pile of bitstreams with no differentiation between them, and hence the thumbnail of the figure has the same status as the PDF of the paper ... it's just another bitstream in the pile.

In terms of Hydrus, we create a Collection per deposit "project" in which multiple Items can be created. Each Item can have multiple Files. All of which can have their own metadata.

mjgiarlo commented 10 years ago

@azaroth42 It sounds like what you're doing in Hydrus is very much in line with what I have in mind.

I'm proposing more or less this domain model:

https://docs.google.com/drawings/d/1eir0xboKb2B-YOilI9YHAX5S8361ti14mJS3diRu3xw/edit?usp=sharing

I pulled that together from an older, outdated diagram, so I may have gotten some bits of it wrong. (And note that I'm only showing datastreams for the GenericFile; I haven't yet added which datastreams would be on the Collection or the Work, but I think for both it'd be: descMD, rightsMD, properties.)

In short, I don't see a Work as a big pile because the stuff already in GenericFile allows you distinguish between a file and its thumbnail(s). If we have use cases that require more robust derivative generation -- not just 1..n thumbs but perhaps preservation masters, etc. -- I can see needing to build in more flexibility (e.g., a more atomistic object model for GFs and derivative files).

awead commented 10 years ago

:+1: to Mike's diagram. We need more pictures! Also, @mjgiarlo psudo-ER diagram captures my understanding of the conceptual layout. Is this what others are seeing in their minds as well?

mjgiarlo commented 10 years ago

I think we may be moving towards @escowles' model:

https://cloud.githubusercontent.com/assets/856924/4593133/4ef830c4-5083-11e4-9ec7-261a9483eb7a.png

awead commented 10 years ago

@mjgiarlo yes, it does seem that way. Although my latest comment to #11 regarded the question of order within a group (Work or Component). How would you do ordering of pages in a book model? A sort field or something?

escowles commented 10 years ago

+1 to sort fields -- easier than rdf:List, etc.

azaroth42 commented 10 years ago

+1 to the option for sorting fields for administrative or curated collections.

However for use cases such as user created reading lists, favourites, galleries and so forth, we would need arbitrary ordering. eg as discussed in #18 and #17

jpstroop commented 10 years ago

If interoperability/shared models is a goal here, I think we should at least have a recommended way of implementing sorting when that's something you need to do. +1 ∞ for constituent parts (components, whatever) having a sort field being the recommended pattern.

That said, there may be cases where components have multiple sort orders or curated hierarchical structures that are more complicated than whole/part and don't need to be fit into the model we're developing (e.g. chapters in a book), or details that are more complex than simple sorting (e.g. "skip this page in a page turner"). If that's your use case then you probably already have some special metadata scheme or data structure that's relevant to your domain (e.g. METS structMap, MODS relatedItem, EAD c*, IIIF sequence/range) for realizing those alternatives, and you should stick that in a stream on the Work.

awead commented 10 years ago

@azaroth42 @jpstroop that seems to be developing over at #11 and yes, definitely optional, even mixing and matching, like having a unsorted parent component or "work" containing a sorted component.

scherztc commented 9 years ago

If everyone is in agreement of the graphical representation of this use case (Multi Page Text) at : https://docs.google.com/document/d/1o-Iq1oKN_W5NXXDQC81pxkhibOz_AhZlY7IShxPTR5M/edit#

then I would like to close this issue. Thanks.

samvera / hydra-works

Use Case: Page level representation (parts of Works) #9