samvera / hydra-works

A ruby gem implementation of the PCDM Works domain model based on the Samvera software stack
Other
24 stars 14 forks source link

Use Case: Research Dataset #11

Closed escowles closed 9 years ago

escowles commented 10 years ago

A research dataset containing a set of files organized into top-level categories of preparatory materials, raw data files, statistics, and visualization images, with multiple files in each category. The visualization images are further organized in a hierarchy by type, and then by X/Y/Z axis.

jcoyne commented 10 years ago

Is "Organized by Hierarchy" an implementation detail? Can you provide the motivation for the organizing into this structure? Is this structure a standard for all research datasets? Can you elaborate about "X/Y/Z axis?" An example project structure would be enlightening.

escowles commented 10 years ago

Organized by hierarchy is an implementation detail, but one that needs support from the model. I don't think these conventions are common enough to warrant modeling -- I'm happy to have a generic Object/Component/File classes and use descriptive metadata to describe the hierarchy.

Here's an example component hierarchy:

Each of the lowest-level components would have a file attached.

awead commented 10 years ago

@escowles are you saying "X/Y/Z" in the sense that you have a 3D hierarchy with 1:N relationships to each so as to encompass all possible implementations needs? Did that question even make sense?

escowles commented 10 years ago

@awead Yes, the dataset in question has spatial data and visualizations of it in 3 axes -- we model that in the hierarchy above. But I want to note that I'm not suggesting that we create Ruby classes to model spatial data. We use generic Component classes with titles like "Visualizations of X-axis", "Visualizations of Y-axis", etc. to label the containers of files.

awead commented 10 years ago

@escowles So if I follow, each axis is an instance of the abstract Component class? I'm using the term "instance" vaguely here.

escowles commented 10 years ago

@awead Yes, each axis would be an instance of the Component class, and contain one or more files that were visualizations in that axis. So "Visualizations of X-axis" would be a Component containing the Components "X-axis file 1" and "X-axis file 2". The Component "X-axis file 1" would contain Files of source image (e.g., high-res TIFF), a thumbnail JPEG, etc.

azaroth42 commented 10 years ago

+1 to this use case, and +1 to the overall model (give or take a 0.1 detail here and there) :+1:

mcritchlow commented 10 years ago

@awead if it's helpful, here's an example of what @escowles has described in our current system http://library.ucsd.edu/dc/object/bb2322141x

jeremyf commented 10 years ago

Each of the lowest-level components would have a file attached.

  • @escowles

Can we create a sub-component off of the lowest-level component that has a file? In other words, are we saying that we can have files attached at any level of the graph? Or are we stating that we have a Node and Leaf construct that once a Leaf you can't become a Node?

escowles commented 10 years ago

@jeremyf I think the model should support nodes with both child nodes and files. At UCSD we don't typically do that, but it seems like a good idea to support it. If we were modelling a filesystem, for example, that would allow both subdirectories and files.

azaroth42 commented 10 years ago

+1 to any level being able to have associated bitstreams

jeremyf commented 10 years ago

:+1: @escowles excellent, its different than what we've been working from, but that is an implementation detail (its easier to hide the ability to add a bitstream to a Curate::Work than it would be add that functionality)

jeremyf commented 10 years ago

Can someone help translate @escowles's issue into a pull request? Someone with an ICLA?

mjgiarlo commented 10 years ago

I'm seeing some enthusiasm about @escowles 's model. And he said:

I'm happy to have a generic Object/Component/File classes and use descriptive metadata to describe the hierarchy

Does this mean we're good with the Work (holds descMD) -> GenericFile (holds bitstream, and optionally holds its own file-specific descMD) model, since I believe @escowles has said that maps well to his model? If so, that seems like it'd bring together Sufia, Worthwhile, Curate, and UCSD, and possibly a bunch more of us without introducing a bunch of new concerns and concepts.

escowles commented 10 years ago

I think this is existing functionality, but want to confirm: Sufia & Worthwhile GenericFiles can link to each other, right? That's the one thing we'd need to encode a hierarchy with a flat set of GenericFiles.

jpstroop commented 10 years ago

Sorry, I'm catching up, but what does

Work (holds descMD) -> GenericFile

mean?

I'm happy to have a generic Object/Component/File classes and use descriptive metadata to describe the hierarchy

Are you saying a that Work couldn't contain a Work? If I'm correct, I'd be :disappointed: if after all this we find ourselves back at requiring METS/MODS/FOXML/EAD/whatever (including an RDF rendition thereof) to impose order rather than letting the model itself reflect relationships that are idiomatic to the constituent parts (streams, constituent models) that make up the object.

Other/alternative orders or hierarchies, sure, to me that's what descriptive metadata is for, but if there are relationships that are integral to the parts that comprise the whole, I think they should be reflected in the model itself.

Apologies if I'm misreading.

mjgiarlo commented 10 years ago

@escowles I'm not sure we've built out the capability to make those links -- unless this is what the recently excised Worthwhile LinkedResources do -- but IMO that is "a small ask" and a reasonable addition to the functionality we already have if that is the cost of making our current model to UCSD-compliant!

mjgiarlo commented 10 years ago

@jpstroop I'm not saying that it's inconceivable that a Work contain another Work -- I'm just not sure how many of our IR-like use cases require this functionality currently. As far as the first phase of Hydra::Works goes, I'm in favor of restricting the scope to IR-like use cases and solving for the 80%. So if we have those use cases, and they seem like commonly needed use cases, let's include Works containing Works in the initial model. If not, I might suggest we defer to the next phase, once we've got a common model for the Sufia/Worthwhile/Curate apps out there.

mjgiarlo commented 10 years ago

(Alternatively, I think it may also be OK to allow this (Works containing Works) in the model we develop if we also provide some guidance on how implementations like ScholarSphere might avoid/ignore/hide/disallow this complexity.)

azaroth42 commented 10 years ago

If the intent is only to solve the very basic single list of files associated in an unordered set, please let's rename it far far away from Work or any of the other terms that imply there's a data model behind it.

I suggest Hydra::BasicGroupOfFiles

jpstroop commented 10 years ago

So if we have those use cases, and they seem like commonly needed use cases, let's include Works containing Works in the initial model. If not, I might suggest we defer to the next phase, once we've got a common model for the Sufia/Worthwhile/Curate apps out there.

Well...I have a PR in for one use case (or four, depending on how you look at it), all of which we've made a dog's dinner of w/ METS (valid XML != good modeling; I could show you but it would burn your eyes. :fire: :sunglasses:).

... I think it may also be OK to allow this (Works containing Works) in the model we develop if we also provide some guidance on how implementations like ScholarSphere might avoid/ignore/hide/disallow this complexity.

Absolutely! @jcoyne said the same thing here.

Maybe there's Hydra::IRWork model that extends Hydra::Work to include validations (or whatever the best approach is) the keep it from ever including a Work.

escowles commented 10 years ago

I don't think DigitalObjectSlashWorkSlashWhatever -> GenericFile -> bitstream is just a single unordered set of files.

IMHO, this is not just simpler than having infinite recursion of GenericFiles/Components, it's also more flexible since it can express relationships other than containment.

mjgiarlo commented 10 years ago

@azaroth42 I was thinking the intent, based on what I was hearing at the Sufia Futures discussion on Friday, was to come up with a model that can underlie Hydrus-based, Sufia-based, Worthwhile-based, and Curate-based apps. Whether we call it a Work or a BasicGroupOfFiles or an IRWork, how good of a fit do you judge this for Hydrus's needs?

mjgiarlo commented 10 years ago

What @escowles said was more articulate and more succinct than what I was saying.

jpstroop commented 10 years ago

Would the DigitalObjectSlashWorkSlashWhatever -> GenericFile -> bitstream approach mean you couldn't use the AF API to manage those 'more flexible' relationships?

azaroth42 commented 10 years ago

I would need to defer to other Stanford folk on the appropriateness for Hydrus. Once there's a proposal, I'm happy to take it back and discuss with them :)

mjgiarlo commented 10 years ago

@azaroth42 Fair enough!

@jpstroop I would think those relationships would be manageable via AF but I defer to folks whose heads are in the code more frequently than mine. @escowles @jcoyne etc.

jpstroop commented 10 years ago

@mjgiarlo @escowles , I see what you're both saying; it just feels like I'd wind up with the mess I already have if relationships had to be managed in a separate stream rather than being integral to the model(s).

As I just said, but in a different way, why not allow the recursion at a higher level of abstraction, and have a subclass for those who want more constraints (e.g. Object -> File -> Stream)?

escowles commented 10 years ago

@jpstroop I think you should be able to extend the default model and add new relationships that are managed with the AF API. But @jcoyne and @no-reply would be the authorities on that. IMHO, it makes sense to keep the base model as simple as possible, and let people extend it to add new relationships, constraints, etc.

jpstroop commented 10 years ago

Simple, yes, absolutely :smile:. But I'd also like to see it be flexible enough to handle recursion out of the box without having to extend anything.

jcoyne commented 10 years ago

I think having recursive Components in the first version is probably a good idea. It seems like this would be hard to add later down the road. That said, I think some applications (like self-deposit) will not support recursion, and thus data from an application that does use recursion may not be interoperable with applications that do. I guess this is a problem of some applications not caring about supporting the whole spec.

escowles commented 10 years ago

In my mind, there are three related modeling exercises I'm thinking about in the near term:

  1. This Works discussion
  2. Access control (where the focus is on finding a way forward we can all accept)
  3. A broader modeling discussion about relationships between Works and authority records and possibly other kinds of external records like PREMIS-style events, geo references, etc.

I had been thinking of Component recursion as part of that last one, but now that I lay it out, I think that it belongs here, so I'm changing my mind and I agree with @jpstroop that we should have Component recursion out of the box.

So I think the model looks something like this: coll-work-comp-file

jcoyne commented 10 years ago

@escowles thanks for the illustration. Very helpful. In this figure can you explain how Work is different from Component?

escowles commented 10 years ago

@jcoyne The only difference at this level is the link with Collections. I suspect there are other differences that will crop up in applications, such as requiring specific metadata fields to be populated.

jpstroop commented 10 years ago

So could Work just be a refined Component that has (belongs_to) a Collection?

jcoyne commented 10 years ago

@jpstroop No, a work may have many Collections.

jpstroop commented 10 years ago

Gotcha. I agree but was afraid to suggest it. :wink:

mjgiarlo commented 10 years ago

OK, I'm down with the @escowles domain model, if I'm reading it correctly that adopters can choose to ignore Components entirely if they wish to do so.

jcoyne commented 10 years ago

@mjgiarlo If you look closely in that figure, GenericFiles don't have descriptiveMetadata, so for something like Sufia we'd want a 1:1 relationship between Files and Components, so there'd be a place to hang descriptions of files.

azaroth42 commented 10 years ago

For Stanford, we're much happier with Esme's model than no components at all, or only a single level. There are some slight tweaks for consideration, but they can go in a separate thread if we can agree on this as a baseline.

mjgiarlo commented 10 years ago

@jcoyne That seems reasonable to me.

jpstroop commented 10 years ago

This looks like a promising convergence!

awead commented 10 years ago

@escowles this smacks of EAD. Not in a bad way, though. :smile: It might be good to use this model in different expressions, e.g. hydrus, worthwhile and Sufia. Also, there are implied requirements. If I want to deposit one file, I have to have a Work, but I don't have to have a collection? So, are these assertions correct:

azaroth42 commented 10 years ago

I think that all of @awead's assertions are correct :)

awead commented 10 years ago

Also, another potential :elephant: in the room, is anything in an order? I'm seeing discussion in #18. This would include:

mjgiarlo commented 10 years ago

:+1: to the @awead assertions.

I don't have any requirements about ordering but I am sure that others do. :)

mjgiarlo commented 10 years ago

@awead Whatever data model we wind up ratifying in the coming weeks, it would be good to sit down and map how we'll migrate our existing ScholarSphere and ArchiveSphere data into this model. It doesn't look like it'll be that hard (#famouslastwords) at the moment...

(Drat, there's no skunk emoji.)

escowles commented 10 years ago

:+1: to @awead's asssertions, and I think all of those ordering scenarios should be supported. I think ordering components and files via a sort property is the easiest, lightest way to do that. Sorting works within collections is a different issue (because many-to-many), so there probably needs to be something like the Ordered List Ontology for handling ordered collections (playlist/etc.).

awead commented 10 years ago

@escowles, :+1: to ordered list ontology. I'm also assuming that these aspects would be baked-in to the model but easy to ignore of you weren't worried about order. Also, might sort fields be implementation-specific?

@mjgiarlo yeah, this shouldn't be hard to map, although we'll mint a bunch more pids to create the additional "works" for each existing GenericFile.

mjgiarlo commented 10 years ago

@awead If we decide to make use of the Batch objects we already have in the system such that every Batch of GenericFiles is a Work -- not saying we should, but it's one migration decision we could make -- we may also need to create Components to hold descriptive metadata about Files. (Since in the @escowles model, a File object cannot hold descMD.) Still pretty easy to map.