Closed azaroth42 closed 9 years ago
I think we have a few potential use cases which need to be fleshed out:
What about technical, provenance, etc. metadata about each of the 'representations'? At a minimum there are checksums and a media type for each, but what about software, hardware, events, etc.. If each representation were its own GenericFile, this wouldn't be problem and the sort order (or at least the canonical one) could be a property of that.
Otherwise you're likely going to wind up hacking the property/stream relationships into their respective names, (e.g. jp2_md5_checksum) and...hopefully that speaks for itself :smile:.
@jpstroop Sounds good, submit a use case for that?
@jpstroop I think I get what you are saying: Represenations and Parts could all be the same entity. Right?
Yes, I think so. For example, a Representation of a Book has many Representations of a Page.
@escowles @mcritchlow is a "Part" as described above, what you call a "Component"?
@jpstroop I like that. I started thinking about RDF list and I could hear @no-reply sighing 1700 miles away.
(and, furthermore, Multi-Volume Booksets have Books that sort in a certain order, etc.).
@jpstroop is wary of unnecessary recursion, but I think it feels right in this case.
Will get a use case in tonight or tomorrow--should it live in this thread or start a new issue?
@jpstroop New issue please! Even better is to put one in ./use-cases
as described in the CONTRIBUTING document.
@jcoyne Yes, Part as described here aligns closely with our Components. We were envisioning that our Components in the Hydra:Works context will align pretty well with GenericFile.
@jpstroop "If each representation were its own GenericFile, this wouldn't be problem and the sort order (or at least the canonical one) could be a property of that."
:+1: In our currently proposed new model, we have partName and partNumber properties for this. At the moment they're slated as local ucsd properties, but perhaps there's a convention to agree on here.
I disagree with @jpstroop (for once! mark your calendars, folks!) There needs to be a separation of Work, Representation-of-Work, Page and Representation-of-Page because X to Representation-of-X is a 1:N relationship, in the same way that Work to Page is 1:N.
In Pseudo-RDF:
_:book a h:Work;
h:hasBitstream <PDF> ;
h:hasComponentList [ _:page1, _:page2, _:page3] ;
h:isDescribedBy <book-metadata> .
_:page1 a h:Component;
h:hasBitstream <JPG> ;
h:hasBitstream <ALTO-XML> ;
h:hasBitstream <masterTiff> ;
h:isDescribedBy <page-metadata> .
<JPG> a h:Bitstream ;
h:isDescribedBy <ImageTechnicalMetadata> .
...
In the above,
Will graffle it up tomorrow.
This brings up two questions we've been grappling with in our data model:
@escowles I think you are getting into implementation detail at this point, which we might want to be wary of. However, you've enticed me, so in my opinion Components should be a completely optional part of the model. I think they make the simplest models overly complex. I'm thinking of a use case like: "There is a university publication and the only representation of it is a single PDF"
Coming from #8, here. I don't think @azaroth42 and I are in disagreement...only that we're thinking at different levels of detail. (Note that I'm keeping the name Hydra::Work here, but I really don't think that's the best name for it -- maybe DigitalObject?).
Consider this, in pseudo-Ruby
Hydra::Work
has_many :files, class: Sufia::GenericFile
has_many :things, class: Hydra::Work
has_metadata ....
Or, in pseudo-RDF (based on @azaroth42's example):
# Book and Component are extensions of Work
_:Book;
rdfs:subClassOf h:Work .
_:Component;
rdfs:subClassOf h:Work .
# So an instance could look like:
_:book a :Book;
h:hasGenericFile <PDF> ;
h:hasComponent _:page1;
h:hasComponent _:page2;
h:hasMetadata <book-metadata> .
_:page1 a :Component;
h:hasGenericFile <JPG> ;
h:hasGenericFile <ALTO-XML> ;
h:hasGenericFile <masterTiff> ;
h:hasMetadata <page-metadata> ;
x:sort 1 .
<PDF> a s:GenericFile ;
h:hasMetadata <ImageTechnicalMetadata> .
<JPG> a s:GenericFile ;
h:hasMetadata <ImageTechnicalMetadata> .
Again. I'm wary of recursion, but in this case Hydra::Work
itself would be a somewhat dumb base class that only makes recursion possible for the sake of letting extensions of it apply constraints that fit their domain, and behaviours like sorting could be mixed in as needed, etc.
Is this too high level / abstract?
@jpstroop That's a wonderful example. This is very close to what I'm thinking. Only I hadn't considered making Component a subclass of Work. I'm wondering if anyone has a counter example for this? If this sort of model is adopted, I do think we need to urge restraint about using hierarchy. Just because it's there doesn't mean you should always use it.
I think my only issue with @jpstroop's example is that sufia:GenericFile seems overly heavy for the task at hand. I don't think the File class needs to have any descriptive metadata, but instead should just be a bitstream and associated technical metadata.
@jpstroop @jcoyne I think the inheritance might work out better the other way around: DigitalObject/Work could be a subclass of Component that adds links to Collections and whatever else is reserved for DigitalObject/Work alone.
Everyone is going to have parts of things that have more or less the same requirements as the things themselves.... Book sets, Books, Pages, Annotations, Archival Collections with Series/Files/Items.
@escowles, I agree, descriptive metadata about a GenericFile wouldn't Respect The Model :tm: in this case. Could it be constrained by defining a few different hasMetadata subproperties (descriptive, technical, rights, provenance are the ones that come to mind) with different domains? I don't know if I like the idea, but it's one way of solving the problem.
In a simpler use case, one without Components involved (e.g. self-deposit), it's vital for the individual files to be able to support descriptive metadata. Where else would it go?
@jcoyne In the simple case, you'd just have a DigitalObject (with descriptive metadata) and a single File (with technical metadata + bitstream). If you have more than one File, wouldn't you want to have a Component for each of them?
Yeah, what @escowles said. Or Hydra::GenericFile < Sufia::GenericFile
, but that would just get confusing.
@escowles So you are saying if a Component/DigitalObject ever has more than one GenericFile, then it should make a Component for each file?
@jcoyne Yes, if the Files are different enough to need different descriptive metadata, then I think there should be separate Components to hold that.
@jcoyne sorry, missed your question before:
If this sort of model is adopted, I do think we need to urge restraint about using hierarchy. Just because it's there doesn't mean you should always use it.
Sure, I suppose so--either by demonstration (in exemplar subclasses) or documentation, but most people with any experience should know that. And there are cases (archives come to mind, or maybe annotations on annotations (@azaroth42, is that a thing?)) where theoretically infinite recursion is actually what you want. We're just giving you enough rope....
An institutional repository equivalent for the book example would be a paper that has several figure images that are supplied along with the final PDF. The professor provides the data that was used to create the figure, and the repository creates and manages a thumbnail for display on the paper's splash page.
Thus:
_:Paper a h:Work ;
h:hasGenericFile <PDF> ;
h:hasComponent _:figure1, _:figure2 ;
h:hasMetadata <Metadata> .
_:figure1 a h:Component ;
h:hasGenericFile <Image-Figure1>, <Thumbnail-Figure1>, <Excel-Figure1>,
h:hasMetadata <Figure1Metadata> .
Question: Are collections/lists/sets of Works in scope for this discussion?
My sense is yes, and this is my problem with "Work" as a label. A Set or Collection is just a (sub)type of Work, at least in terms of what I'm proposing.
Annotations on annotations are definitely a thing, but I'm not certain that an annotation is a Work in this sense. Archives are a better use case here, IMO.
Why not? They have creators, provenance, dates, potentially rights...
Question: Are collections/lists/sets of Works in scope for this discussion? My sense is yes, and this is my problem with "Work" as a label. A Set or Collection is just a (sub)type of Work, at least in terms of what I'm proposing.
I'm fine with that being the case, but I suspect @jcoyne may push back in terms of simplicity so wanted to make sure we're on the same page. #8 also probably cares about this.
@azaroth42 We always use a Component when we have multiple files, in part to group derivatives. So for your paper-and-figures example, we would do:
_:Paper a h:Work ;
h:hasComponent _:paper1, _:figure1, _:figure2 ;
h:hasMetadata <Metadata> .
_:paper1 a h:Component ;
h:hasGenericFile <PDF-Paper1>, <Thumbnail-Paper1>, <LaTeX-Paper1>,
h:hasMetadata <Paper1Metadata> .
_:figure1 a h:Component ;
h:hasGenericFile <Image-Figure1>, <Thumbnail-Figure1>, <Excel-Figure1>,
h:hasMetadata <Figure1Metadata> .
Annotations on annotations are definitely a thing, but I'm not certain that an annotation is a Work in this sense. Archives are a better use case here, IMO. Why not? They have creators, provenance, dates, potentially rights...
I'm not against annotations being considered, tagging @jcoyne and #8 again.
@escowles Gotcha. My thoughts there would be to treat the Work as the component for grouping the full representations together. It would also need some other relationship to related files such as additional material that shouldn't be grouped together like that.
_:Paper a h:Work ;
h:hasComponent _:figure1, _:figure2 ;
h:hasGenericFile <PDF-Paper1>, <Thumbnail-Paper1>, <LaTeX-Paper1> ;
h:hasRelatedFile <AdditionalMaterial1> ;
h:hasMetadata <Metadata> .
_:figure1 a h:Component ;
h:hasGenericFile <Image-Figure1>, <Thumbnail-Figure1>, <Excel-Figure1> ;
h:hasMetadata <Figure1Metadata> .
But I can see the utility of having the additional level of abstraction. I would prefer in your model to have the Figure be part of the paper component, rather than flattening out into the top level Work.
Question: Are collections/lists/sets of Works in scope for this discussion?
Yes, for some definitions of "collections/lists/sets". ;) We should probably move that discussion somewhere else.
I'm catching up on numerous threads at the same time so the confusion may be only mine, but I'll ask: I've seen some folks suggest that GenericFiles and Components are sufficiently similar and others suggest that the the model we're developing needs Components in addition. Could someone point me at an IR-like use case where the Collection -> Work -> GenericFile model, as currently implemented in Worthwhile, wouldn't work?
Work is a paper. The work has a representation which is a PDF. The Work contains a Figure. The Figure has two representations, one of which is an Image, the other is a CSV file.
The Work should not contain the PDF, the Image and the CSV as they are not all representations of the same Resource.
Thanks for that, @azaroth42. At a conceptual level, I agree with you. From a pragmatic standpoint, I don't think many of our (Penn State, ScholarSphere, other current Sufia implementions) users will care about that distinction; they'd be happy to upload a new work that contains three files: the PDF, the image, and the CSV file. (So it'd be a work with descMD plus three GenericFiles, each with one bitstream.)
Is this use case currently supported by Hydrus or any of our other products? (I'm not trying to suggest it's not a relevant concern, @azaroth42, but I do wonder: if this isn't already implemented in one of our products, can we punt this off to a future phase once we've made progress on our short-term goals?)
Okay, then how about server generated derivatives?
I think what you're proposing is that the Work would contain a big ole pile of bitstreams with no differentiation between them, and hence the thumbnail of the figure has the same status as the PDF of the paper ... it's just another bitstream in the pile.
In terms of Hydrus, we create a Collection per deposit "project" in which multiple Items can be created. Each Item can have multiple Files. All of which can have their own metadata.
@azaroth42 It sounds like what you're doing in Hydrus is very much in line with what I have in mind.
I'm proposing more or less this domain model:
https://docs.google.com/drawings/d/1eir0xboKb2B-YOilI9YHAX5S8361ti14mJS3diRu3xw/edit?usp=sharing
I pulled that together from an older, outdated diagram, so I may have gotten some bits of it wrong. (And note that I'm only showing datastreams for the GenericFile; I haven't yet added which datastreams would be on the Collection or the Work, but I think for both it'd be: descMD, rightsMD, properties.)
In short, I don't see a Work as a big pile because the stuff already in GenericFile allows you distinguish between a file and its thumbnail(s). If we have use cases that require more robust derivative generation -- not just 1..n thumbs but perhaps preservation masters, etc. -- I can see needing to build in more flexibility (e.g., a more atomistic object model for GFs and derivative files).
:+1: to Mike's diagram. We need more pictures! Also, @mjgiarlo psudo-ER diagram captures my understanding of the conceptual layout. Is this what others are seeing in their minds as well?
I think we may be moving towards @escowles' model:
https://cloud.githubusercontent.com/assets/856924/4593133/4ef830c4-5083-11e4-9ec7-261a9483eb7a.png
@mjgiarlo yes, it does seem that way. Although my latest comment to #11 regarded the question of order within a group (Work or Component). How would you do ordering of pages in a book model? A sort field or something?
+1 to sort fields -- easier than rdf:List, etc.
+1 to the option for sorting fields for administrative or curated collections.
However for use cases such as user created reading lists, favourites, galleries and so forth, we would need arbitrary ordering. eg as discussed in #18 and #17
If interoperability/shared models is a goal here, I think we should at least have a recommended way of implementing sorting when that's something you need to do. +1 ∞ for constituent parts (components, whatever) having a sort field being the recommended pattern.
That said, there may be cases where components have multiple sort orders or curated hierarchical structures that are more complicated than whole/part and don't need to be fit into the model we're developing (e.g. chapters in a book), or details that are more complex than simple sorting (e.g. "skip this page in a page turner"). If that's your use case then you probably already have some special metadata scheme or data structure that's relevant to your domain (e.g. METS structMap
, MODS relatedItem
, EAD c*
, IIIF sequence
/range
) for realizing those alternatives, and you should stick that in a stream on the Work.
@azaroth42 @jpstroop that seems to be developing over at #11 and yes, definitely optional, even mixing and matching, like having a unsorted parent component or "work" containing a sorted component.
If everyone is in agreement of the graphical representation of this use case (Multi Page Text) at : https://docs.google.com/document/d/1o-Iq1oKN_W5NXXDQC81pxkhibOz_AhZlY7IShxPTR5M/edit#
then I would like to close this issue. Thanks.
A multi-page text (such as a book, magazine, article, or similar) may be represented by a variety of different resources with different uses. Contributed towards refining #8
Given a simple object that could be rendered in a page turning interface: