ucsdlib / damspas-rd

A Digital Collections application based on Hyrax
MIT License
3 stars 2 forks source link

Finalize File Use modeling #111

Closed mcritchlow closed 6 years ago

mcritchlow commented 7 years ago

Descriptive summary

We need to reach agreement on our mapping of existing file use properties in DAMS4 to DAMS5. Longshou noted this is the last (major) mapping/modeling task that we've yet to finish.

If a meeting is needed let's get that setup.

Rationale

Because data models.

Related work

46 - dependency

mcritchlow commented 7 years ago

Going to initially assign @arwenhutt and @remerjohnson and then we can sort out ultimate responsibilities, next steps, etc., as needed.

arwenhutt commented 7 years ago

@mcritchlow @lsitu @remerjohnson and I discussed this today and he is taking lead on this.

There will probably be some questions here for ITS at some point, should he @ both of you?

mcritchlow commented 7 years ago

Yep that sounds good to me. We could hash those out here if that works, and then of course meet if needed as well.

remerjohnson commented 7 years ago

@mcritchlow @gamontoya So a missing piece of this for me is how functions similar to our notion of "File Use" work when just using the Horton/Hyrax interface. I'd like to make recommendations that are basically similar to current functionality, just in batch form. Or are all derivatives inferred from file type/format?

mcritchlow commented 7 years ago

@remerjohnson, @lsitu has the most experience in this. so pinging him to look at your question

remerjohnson commented 7 years ago

Thanks. Also would like to know if the PCDM Use is utilized by Hyrax or not.

lsitu commented 7 years ago

@remerjohnson As far as I know, the "File Use" work in hyrax is still on going and it won't be reflected in the Horton/Hyrax interface at this time. All files uploaded is marked as "original_file" and the derivatives are bounded to file type/format (which determines the derivative extension) for retrieving and no technical metadata is extracted and persisted in the Fedora/triplestore. I think we can apply these file use properties you recommended to the Batch Import form so that we can have full support over the critical Excel batch import function as the first step, and then see what we can have from upstream to make it available on the horton/hyrax UI. @mcritchlow Do you have any plans and thoughts on this?

remerjohnson commented 7 years ago

@lsitu So does that mean we can use what's in the existing Excel File Use vocab, or should I revise them?

lsitu commented 7 years ago

@remerjohnson We could but we had better map it to PCDM Use if we can, especially the related works are ongoing. If we can't stick with it, yes, we could implement our own set of file use properties. I have to mention that the file use in hyrax is not a property but a rdf:type with a url like http://pcdm.org/use#OriginalFile.

remerjohnson commented 7 years ago

So, sketching this out. @mcritchlow @lsitu. Keep in mind this is biased for DOMM's use... (e.g. our "original" use values will generate service derivatives, thumbnails etc)

  1. We can squash all the use#OriginalFile values into one value (original) if you are certain Hyrax will just handle all that based on the file type. Would it be fine with data, and offer up .zip and .tar files, and just the (binary) file if it doesn't recognize the format? Sort of like this?
  2. Feel free to comment on what we should move our current document-service use case to (mostly now that's what we use for PDFs). I've mapped it to #ServiceFile... If that's just easier as another #OriginalFile let me know
  3. I preserved alternate but there is no true mapping to PCDM Use. If we don't think we need that I'm sure it could map to something...
  4. Any other use cases we need @ucsdlib/domm?
DAMS5 DAMS4 PCDM Use Desired Behavior
preservation source? use#PreservationMasterFile Do not generate derivatives, curator-only access/download
image-original image-source use#OriginalFile Generate preview image, zoom/viewer, and thumbnail
audio-original audio-source use#OriginalFile Generate .mp3 stream
video-original video-source use#OriginalFile Generate video stream, thumbnail
data-original data-service use#OriginalFile Allow download
document-extractedText document-service? use#ExtractedText Fulltext indexing
document-service document-service use#ServiceFile Generate preview, thumbnail, allow download
transcript document-service? use#Transcript Fulltext indexing? thumbnail?
alternate alternate N/A? or use#PreservationMasterFile Do not generate derivatives, curator-only download
ghost commented 7 years ago

What about:

I'm ingesting objects now that have a .zip of several raw audio files that were edited down to the public version. The curator wants those source files to be archived in case a different edit might be desired in the future.

I guess these would be "preservation".

remerjohnson commented 7 years ago

@GregReser Was hoping those could all go in as preservation. The access controls I'm not so clear on. I was assuming these would be "curator-only" but as I mentioned yesterday, that role is not so clear and I could imagine cases where we want people to access this stuff...

But yeah, that's where I come back to us needing probably more than just 2 use values... could easily see the case where we'd need 3 once preservation is needed

ghost commented 7 years ago

Does "preservation" do anything other than control derivative creation and visibility? Do we only preserve (via Chronopolis) that one file and not originals or do we really preserve all files equally? I'm guessing all files are preserved. If that's true, then "preservation" also denotes that this is the most complete file we have. It may not be the best (in terms of color correction, etc.) but it has all the original content available.

remerjohnson commented 7 years ago

Basically yeah, it's just denoting derivative and visibility. The vocab says

Best quality representation of the Object appropriate for long-term preservation.

So, yes it would seem (and I would hope!) the source "original files" that get derivatives (e.g. TIFFs we want derivatives of to display) also would be preserved, but this preservation designation is a way of saying, the system doesn't need to do anything special with this file(s), just preserve it as part of this file set.

ghost commented 7 years ago

And if those source files need to be available to users, they will just be linked to components and typed as original. For instance, an uncropped, unretouched image file could be added as another component with a title explaining what it is.

remerjohnson commented 7 years ago

True, although in that case, if marked original, it would generate derivatives. Is that fitting the use case?

ghost commented 7 years ago

Yes, because a preview image, zoom/viewer, and thumbnail would be needed for navigation and viewing. This assumes the file is meant to be viewed - say we thought the unretouched and retouched versions were of equal value, or at least, we want the user to see both versions and decide for themselves.

arwenhutt commented 7 years ago

If I'm understanding @lsitu correctly, is it necessary to include "audio" "text" etc. as part of the vocabulary or could the specific way a "file use" type is implemented (say for example what type of derivative is created for a file marked with "original") be driven by the file's format?

remerjohnson commented 7 years ago

@arwenhutt Right, if the system can handle that, we could just use original. As far as I can tell, that's the current state of Hyrax (if something's not recognized, it's just offered as binary).

Then it could look like:

DAMS5 DAMS4 PCDM Use Desired Behavior
preservation source? use#PreservationMasterFile Do not generate derivatives, curator-only access/download
original source use#OriginalFile Generate preview image, zoom/viewer, and thumbnail (depends on format)
document-extractedText document-service? use#ExtractedText Fulltext indexing
service service use#ServiceFile Generate preview, thumbnail, allow download?
transcript document-service? use#Transcript Generate preview? fulltext indexing? thumbnail?
alternate alternate N/A? or use#PreservationMasterFile Do not generate derivatives, curator-only download
arwenhutt commented 7 years ago

okay, thanks, I was going on the proposed values in the table.

The other use case that I can think of is where curators wanted to display a derivative version, but also allow download of the source file. This could apply to any file format, but would evoke the same actions as original PLUS make the file in question publicly downloadable. It could be labeled something like original-open.

It seems that preservation and alternate trigger the same functions. Are both necessary?

A bit nitpicky, but I don't really like the label for "preservation". In most cases the "original file" is also going to be the richest and most preservation worthy, using "preservation" implies something different. This may purely be semantics though, so unless someone else wants to make a case for it, probably not worth the trouble.

arwenhutt commented 7 years ago

@remerjohnson okay, so looking at the updated table you added to your previous comment, does service serve the function I mentioned?

(it just occurred to me, are we mixing apples and oranges? aren't the hyrax use terms internal use indicators, while the terms we need for ingest are more triggering a series of actions performed on ingest, creation of additional files in some cases, and assignment of the final set of files a hyrax use term.)

ghost commented 7 years ago

I had the same concern about the label "preservation", especially if it were ever used for some kind of automatic process, like archiving or exporting. Trying to determine "Best quality representation of the Object appropriate for long-term preservation" is open to interpretation and hard to apply consistently. I don't know what the answer is since we are trying to work within Hyrax. Maybe we just consider preservation to be what we used to call source. Source isn't necessarily better or more worthy of preservation and original, it's just well, the source of the original. This seems to be what Ryan was leaning towards in the first table.

remerjohnson commented 7 years ago

@arwenhutt I don't see DOMM using service really, and the system would seem to generate those, same with thumbnail. I think this "source download" use case would need its own label/rules.

Keep in mind, anything not recognized would (I think) offer the "source" file as download. I have no idea if this kind of override functionality can be implemented. @mcritchlow @lsitu ?

it just occurred to me, are we mixing apples and oranges? aren't the hyrax use terms internal use indicators, while the terms we need for ingest are more triggering a series of actions.

This seems to be the only way to indicate those series of actions. If there's another way to tell the system what to do, I'm all for it. As Longshou noted, these PCDM Use terms have not really been implemented, so everything currently in Hyrax is just marked as original. It's hard making a vocab for functions I do not know the possibility of implementation of :wink:

arwenhutt commented 7 years ago

So I'm wondering then of the value of letting the PCDM terms guide our approach?

remerjohnson commented 7 years ago

That's not something I could speak to unfortunately. Would need dev input.

arwenhutt commented 7 years ago

It just seems we're winding ourselves into a kind of knot, trying to resolve our local use cases, with the as yet unimplemented list of PCDM terms. Might be worth considering whether they'll make things more or less complicated.

Aside from that, I think it's also worth thinking about whether there's value in separating these two ideas:

  1. terms that indicate what the system should do with the file on ingest
  2. terms used for tracking the purpose of the file within the system

It could simplify things.

lsitu commented 7 years ago

@arwenhutt @remerjohnson: It seems like a good strategic to start. And we may want to know about the following issues:

arwenhutt commented 7 years ago

@lsitu Thanks! I think the question of what derivatives we need will mostly come from your side of the house right? Based on IIIF requirements, streaming servers, display calls, etc.

Good to know the default for hyrax original is to make that downloadable. So then we would be adding a restriction, rather than a permission.

remerjohnson commented 7 years ago

Yes, taking into account that original will allow the download of the source file, I have introduced curated (name not important for now) that is our usual use case, where derivative files are the download option.

DAMS5 DAMS4 PCDM Use Desired Behavior
preservation source? use#PreservationMasterFile Do not generate derivatives, curator-only access and download
original source use#OriginalFile Generate derivatives, allow public download of source file
curated service ? Generate derivatives, allow public download of derivative file, curator-only download of source file
document-extractedText document-service? use#ExtractedText Fulltext indexing
transcript document-service? use#Transcript Generate preview? Public download
lsitu commented 7 years ago

@arwenhutt @remerjohnson For the derivatives, I mean whether there are any kind of derivatives need to created/stored in derivative filestore/Fedora for any purposes or not. Regarding hyrax allowing original to be downloadable, I think it doesn't matter what file use name it is as Ryan said, but we need to know what will be restricted for public access, which is something that we need to implement.

arwenhutt commented 7 years ago

@remerjohnson can you work with @lsitu on resolving any other open issues for this ticket?

remerjohnson commented 7 years ago

@arwenhutt sure

@lsitu Are there any outstanding issues?

lsitu commented 7 years ago

@arwenhutt @remerjohnson No. I don't see any out standing issues but just confusing regarding the "original" file that allow public access and the "presentation" derivative that restricts to curator-only download. In dams4, I understand that the "original" source files are restricted to curator download instead. Also, do you expect any of the service files like "curated", "presentation", "transcript" etc to be stored along with the source files on fedora or not? I am assuming that we all agree on storing source files and "document-extractedText" in Fedora.

remerjohnson commented 7 years ago

but just confusing regarding the "original" file that allow public access and the "presentation" derivative that restricts to curator-only download. In dams4, I understand that the "original" source files are restricted to curator download instead.

I know, it was confusing to me, too. You're correct that the current behavior in dams4 differs from the default Hyrax original function, which is why I came up with curated. curated would mirror what we have now in dams4 - generate derivatives, and the public can download the derivatives.

In original, curated, and presentation, there's public "access" in the sense they can view the derivative: they just treat the downloading of files (and which file to download) differently.

Also, do you expect any of the service files like "curated", "presentation", "transcript" etc to be stored along with the source files on fedora or not? I am assuming that we all agree on storing source files and "document-extractedText" in Fedora.

True, those should probably be stored, but I must admit I am not knowledgeable in this area. I've only seen some sporadic mentions of this in the Hydra Slack. @mcritchlow ?

mcritchlow commented 7 years ago

So I think the distinction we ideally want to make, to follow the convention (and justification behind the convention) is to have anything source/original/master, whatever we call it, persisted in Fedora. And any derivatives/generated files be on disk for fast retrieval. Essentially, being treated as a cache.

By that I mean, anything we would make a derivative from, would go in Fedora. Any derivative, would not. I'm a little fuzzy on what curated and presentation means, and I'm not sure that helps. My sense is our variety of use cases is definitely not supported upstream, so we're going to have to make this work locally.

remerjohnson commented 7 years ago

So I think the distinction we ideally want to make, to follow the convention (and justification behind the convention) is to have anything source/original/master, whatever we call it, persisted in Fedora. And any derivatives/generated files be on disk for fast retrieval. Essentially, being treated as a cache.

That's definitely the things I've heard in Fedora, as well as horror stories for what happens when you store "everything".

I think this could apply regardless of our File Uses above. curated is what I came up with to do roughly what we do now in dams4... that is, we ingest/store the original e.g. .wav file, curators can download that file, but the public can't. Where it gets tricky (and I don't know if this is possible), is if we can let the public download the derivative .mp3, while still allowing curators to download that source .wav.

Then the presentation scenario is exactly the same, except the public can't download anything, they just get the stream.

mcritchlow commented 7 years ago

Gotcha. @lsitu can probably speak more authoritatively, but I don't see any reason why we couldn't support that same mp3 vs wav use case depending on the user role in Hyrax. Assuming it's a global rule, of course. I suspect the same about presentation.

remerjohnson commented 7 years ago

Also, should point out document-extractedText and transcript would be "new" use cases and are therefore probably not critical for implementation. Respectively, they would cover offering mine-able text ("fulltext indexing, such as a plaintext version of a document, or OCR text.") and transcripts/subtitles for movies/audio.

Both of these we don't really do now so this might involve a separate planning?

mcritchlow commented 7 years ago

Those might warrant their own tickets and be tagged as enhancements. @arwenhutt and @gamontoya might know more about those than I do, but if it's new functionality it might need to float through the Products Groups as well. Not sure.

arwenhutt commented 7 years ago

Respectively, they would cover offering mine-able text ("fulltext indexing, such as a plaintext version of a document, or OCR text.") and transcripts/subtitles for movies/audio.

How would these new use cases intersect with our current use of fulltext indexing for some items?

mcritchlow commented 7 years ago

It sounds like the difference is they're supplied on ingest, rather than being generated from an existing file via our automated fulltext indexing. Is that right, @remerjohnson ?

Transcripts for A/V files would be totally new territory.

remerjohnson commented 7 years ago

@mcritchlow That sounds correct. Although I am not aware of all the currently gets fulltext indexing so :smile:

arwenhutt commented 7 years ago

Transcripts for A/V files would be totally new territory.

Yep, we just might want to be aware when determining processes for the ingested fulltext that they play nicely with extracted fulltext, and ideally can be handled by many of the same processes (for simplicity). I admit I don't know much about those current processes either though, just an impression that it's usually from pdf's and the fulltext is stored in the index.

remerjohnson commented 7 years ago

Yes. It could be that our process will generate the extractedText and we never supply one, much like we never supply thumbnails.

mcritchlow commented 7 years ago

True. But going back to one of the first examples you mentioned, a supplied OCR file would be something that presumably wouldn't be extracted otherwise from an original file. At least not with our regular automated extraction process or at the desired level of quality/detail. So if that is something we need to support, we'll need to make sure that File type gets mapped to fulltext indexing, for example.

arwenhutt commented 7 years ago

Ignore this if it's too much of a distraction, but since we're talking about file behavior, do we want to make an effort to make extractedText retrievable in some form? Maybe the index already does this, or it could be something to think about for a once and future api spec, or maybe it's something that we need to determine as part of the fileUse spec.

It seems like that text would potentially of interest to DH folks.

remerjohnson commented 7 years ago

Yes, that use case was more of a Collections as Data idea

gamontoya commented 7 years ago

@remerjohnson @mcritchlow @lsitu FYI, the download the .mp3 derivative while allowing curators to download the source .wav file, no user is currently allowed to download .mp3 derivatives.

We do have these use cases:

remerjohnson commented 7 years ago

@gamontoya Ah yes, those are more representative scenarios... I guess this is difficult because formats are tied to access and it's hard to anticipate all of that.

I think this makes it so that presentation isn't needed...

gamontoya commented 7 years ago

@remerjohnson Yeah, we should get away from tying rules to format types.