Closed mcritchlow closed 6 years ago
Going to initially assign @arwenhutt and @remerjohnson and then we can sort out ultimate responsibilities, next steps, etc., as needed.
@mcritchlow @lsitu @remerjohnson and I discussed this today and he is taking lead on this.
There will probably be some questions here for ITS at some point, should he @ both of you?
Yep that sounds good to me. We could hash those out here if that works, and then of course meet if needed as well.
@mcritchlow @gamontoya So a missing piece of this for me is how functions similar to our notion of "File Use" work when just using the Horton/Hyrax interface. I'd like to make recommendations that are basically similar to current functionality, just in batch form. Or are all derivatives inferred from file type/format?
@remerjohnson, @lsitu has the most experience in this. so pinging him to look at your question
Thanks. Also would like to know if the PCDM Use is utilized by Hyrax or not.
@remerjohnson As far as I know, the "File Use" work in hyrax is still on going and it won't be reflected in the Horton/Hyrax interface at this time. All files uploaded is marked as "original_file" and the derivatives are bounded to file type/format (which determines the derivative extension) for retrieving and no technical metadata is extracted and persisted in the Fedora/triplestore. I think we can apply these file use properties you recommended to the Batch Import form so that we can have full support over the critical Excel batch import function as the first step, and then see what we can have from upstream to make it available on the horton/hyrax UI. @mcritchlow Do you have any plans and thoughts on this?
@lsitu So does that mean we can use what's in the existing Excel File Use vocab, or should I revise them?
@remerjohnson We could but we had better map it to PCDM Use if we can, especially the related works are ongoing. If we can't stick with it, yes, we could implement our own set of file use properties. I have to mention that the file use in hyrax is not a property but a rdf:type with a url like http://pcdm.org/use#OriginalFile.
So, sketching this out. @mcritchlow @lsitu. Keep in mind this is biased for DOMM's use... (e.g. our "original" use values will generate service derivatives, thumbnails etc)
original
) if you are certain Hyrax will just handle all that based on the file type. Would it be fine with data, and offer up .zip and .tar files, and just the (binary) file if it doesn't recognize the format? Sort of like this?document-service
use case to (mostly now that's what we use for PDFs). I've mapped it to #ServiceFile... If that's just easier as another #OriginalFile let me knowalternate
but there is no true mapping to PCDM Use. If we don't think we need that I'm sure it could map to something... DAMS5 | DAMS4 | PCDM Use | Desired Behavior |
---|---|---|---|
preservation | source? | use#PreservationMasterFile | Do not generate derivatives, curator-only access/download |
image-original | image-source | use#OriginalFile | Generate preview image, zoom/viewer, and thumbnail |
audio-original | audio-source | use#OriginalFile | Generate .mp3 stream |
video-original | video-source | use#OriginalFile | Generate video stream, thumbnail |
data-original | data-service | use#OriginalFile | Allow download |
document-extractedText | document-service? | use#ExtractedText | Fulltext indexing |
document-service | document-service | use#ServiceFile | Generate preview, thumbnail, allow download |
transcript | document-service? | use#Transcript | Fulltext indexing? thumbnail? |
alternate | alternate | N/A? or use#PreservationMasterFile | Do not generate derivatives, curator-only download |
What about:
I'm ingesting objects now that have a .zip of several raw audio files that were edited down to the public version. The curator wants those source files to be archived in case a different edit might be desired in the future.
I guess these would be "preservation".
@GregReser Was hoping those could all go in as preservation
. The access controls I'm not so clear on. I was assuming these would be "curator-only" but as I mentioned yesterday, that role is not so clear and I could imagine cases where we want people to access this stuff...
But yeah, that's where I come back to us needing probably more than just 2 use values... could easily see the case where we'd need 3 once preservation is needed
Does "preservation" do anything other than control derivative creation and visibility? Do we only preserve (via Chronopolis) that one file and not originals or do we really preserve all files equally? I'm guessing all files are preserved. If that's true, then "preservation" also denotes that this is the most complete file we have. It may not be the best (in terms of color correction, etc.) but it has all the original content available.
Basically yeah, it's just denoting derivative and visibility. The vocab says
Best quality representation of the Object appropriate for long-term preservation.
So, yes it would seem (and I would hope!) the source "original files" that get derivatives (e.g. TIFFs we want derivatives of to display) also would be preserved, but this preservation
designation is a way of saying, the system doesn't need to do anything special with this file(s), just preserve it as part of this file set.
And if those source files need to be available to users, they will just be linked to components and typed as original. For instance, an uncropped, unretouched image file could be added as another component with a title explaining what it is.
True, although in that case, if marked original, it would generate derivatives. Is that fitting the use case?
Yes, because a preview image, zoom/viewer, and thumbnail would be needed for navigation and viewing. This assumes the file is meant to be viewed - say we thought the unretouched and retouched versions were of equal value, or at least, we want the user to see both versions and decide for themselves.
If I'm understanding @lsitu correctly, is it necessary to include "audio" "text" etc. as part of the vocabulary or could the specific way a "file use" type is implemented (say for example what type of derivative is created for a file marked with "original") be driven by the file's format?
@arwenhutt Right, if the system can handle that, we could just use original
. As far as I can tell, that's the current state of Hyrax (if something's not recognized, it's just offered as binary).
Then it could look like:
DAMS5 | DAMS4 | PCDM Use | Desired Behavior |
---|---|---|---|
preservation | source? | use#PreservationMasterFile | Do not generate derivatives, curator-only access/download |
original | source | use#OriginalFile | Generate preview image, zoom/viewer, and thumbnail (depends on format) |
document-extractedText | document-service? | use#ExtractedText | Fulltext indexing |
service | service | use#ServiceFile | Generate preview, thumbnail, allow download? |
transcript | document-service? | use#Transcript | Generate preview? fulltext indexing? thumbnail? |
alternate | alternate | N/A? or use#PreservationMasterFile | Do not generate derivatives, curator-only download |
okay, thanks, I was going on the proposed values in the table.
The other use case that I can think of is where curators wanted to display a derivative version, but also allow download of the source file. This could apply to any file format, but would evoke the same actions as original
PLUS make the file in question publicly downloadable. It could be labeled something like original-open
.
It seems that preservation
and alternate
trigger the same functions. Are both necessary?
A bit nitpicky, but I don't really like the label for "preservation". In most cases the "original file" is also going to be the richest and most preservation worthy, using "preservation" implies something different. This may purely be semantics though, so unless someone else wants to make a case for it, probably not worth the trouble.
@remerjohnson okay, so looking at the updated table you added to your previous comment, does service
serve the function I mentioned?
(it just occurred to me, are we mixing apples and oranges? aren't the hyrax use terms internal use indicators, while the terms we need for ingest are more triggering a series of actions performed on ingest, creation of additional files in some cases, and assignment of the final set of files a hyrax use term.)
I had the same concern about the label "preservation", especially if it were ever used for some kind of automatic process, like archiving or exporting. Trying to determine "Best quality representation of the Object appropriate for long-term preservation" is open to interpretation and hard to apply consistently. I don't know what the answer is since we are trying to work within Hyrax. Maybe we just consider preservation to be what we used to call source. Source isn't necessarily better or more worthy of preservation and original, it's just well, the source of the original. This seems to be what Ryan was leaning towards in the first table.
@arwenhutt I don't see DOMM using service
really, and the system would seem to generate those, same with thumbnail. I think this "source download" use case would need its own label/rules.
Keep in mind, anything not recognized would (I think) offer the "source" file as download. I have no idea if this kind of override functionality can be implemented. @mcritchlow @lsitu ?
it just occurred to me, are we mixing apples and oranges? aren't the hyrax use terms internal use indicators, while the terms we need for ingest are more triggering a series of actions.
This seems to be the only way to indicate those series of actions. If there's another way to tell the system what to do, I'm all for it. As Longshou noted, these PCDM Use terms have not really been implemented, so everything currently in Hyrax is just marked as original
. It's hard making a vocab for functions I do not know the possibility of implementation of :wink:
So I'm wondering then of the value of letting the PCDM terms guide our approach?
That's not something I could speak to unfortunately. Would need dev input.
It just seems we're winding ourselves into a kind of knot, trying to resolve our local use cases, with the as yet unimplemented list of PCDM terms. Might be worth considering whether they'll make things more or less complicated.
Aside from that, I think it's also worth thinking about whether there's value in separating these two ideas:
It could simplify things.
@arwenhutt @remerjohnson: It seems like a good strategic to start. And we may want to know about the following issues:
@lsitu Thanks! I think the question of what derivatives we need will mostly come from your side of the house right? Based on IIIF requirements, streaming servers, display calls, etc.
Good to know the default for hyrax original
is to make that downloadable. So then we would be adding a restriction, rather than a permission.
Yes, taking into account that original
will allow the download of the source file, I have introduced curated
(name not important for now) that is our usual use case, where derivative files are the download option.
DAMS5 | DAMS4 | PCDM Use | Desired Behavior |
---|---|---|---|
preservation | source? | use#PreservationMasterFile | Do not generate derivatives, curator-only access and download |
original | source | use#OriginalFile | Generate derivatives, allow public download of source file |
curated | service | ? | Generate derivatives, allow public download of derivative file, curator-only download of source file |
document-extractedText | document-service? | use#ExtractedText | Fulltext indexing |
transcript | document-service? | use#Transcript | Generate preview? Public download |
@arwenhutt @remerjohnson For the derivatives, I mean whether there are any kind of derivatives need to created/stored in derivative filestore/Fedora for any purposes or not.
Regarding hyrax allowing original
to be downloadable, I think it doesn't matter what file use name
it is as Ryan said, but we need to know what will be restricted for public access, which is something that we need to implement.
@remerjohnson can you work with @lsitu on resolving any other open issues for this ticket?
@arwenhutt sure
@lsitu Are there any outstanding issues?
@arwenhutt @remerjohnson No. I don't see any out standing issues but just confusing regarding the "original" file that allow public access and the "presentation" derivative that restricts to curator-only download. In dams4, I understand that the "original" source files are restricted to curator download instead. Also, do you expect any of the service files like "curated", "presentation", "transcript" etc to be stored along with the source files on fedora or not? I am assuming that we all agree on storing source files and "document-extractedText" in Fedora.
but just confusing regarding the "original" file that allow public access and the "presentation" derivative that restricts to curator-only download. In dams4, I understand that the "original" source files are restricted to curator download instead.
I know, it was confusing to me, too. You're correct that the current behavior in dams4 differs from the default Hyrax original
function, which is why I came up with curated
. curated
would mirror what we have now in dams4 - generate derivatives, and the public can download the derivatives.
In original
, curated
, and presentation
, there's public "access" in the sense they can view the derivative: they just treat the downloading of files (and which file to download) differently.
Also, do you expect any of the service files like "curated", "presentation", "transcript" etc to be stored along with the source files on fedora or not? I am assuming that we all agree on storing source files and "document-extractedText" in Fedora.
True, those should probably be stored, but I must admit I am not knowledgeable in this area. I've only seen some sporadic mentions of this in the Hydra Slack. @mcritchlow ?
So I think the distinction we ideally want to make, to follow the convention (and justification behind the convention) is to have anything source/original/master, whatever we call it, persisted in Fedora. And any derivatives/generated files be on disk for fast retrieval. Essentially, being treated as a cache.
By that I mean, anything we would make a derivative from, would go in Fedora. Any derivative, would not. I'm a little fuzzy on what curated
and presentation
means, and I'm not sure that helps. My sense is our variety of use cases is definitely not supported upstream, so we're going to have to make this work locally.
So I think the distinction we ideally want to make, to follow the convention (and justification behind the convention) is to have anything source/original/master, whatever we call it, persisted in Fedora. And any derivatives/generated files be on disk for fast retrieval. Essentially, being treated as a cache.
That's definitely the things I've heard in Fedora, as well as horror stories for what happens when you store "everything".
I think this could apply regardless of our File Uses above. curated
is what I came up with to do roughly what we do now in dams4... that is, we ingest/store the original e.g. .wav file, curators can download that file, but the public can't. Where it gets tricky (and I don't know if this is possible), is if we can let the public download the derivative .mp3, while still allowing curators to download that source .wav.
Then the presentation
scenario is exactly the same, except the public can't download anything, they just get the stream.
Gotcha. @lsitu can probably speak more authoritatively, but I don't see any reason why we couldn't support that same mp3 vs wav use case depending on the user role in Hyrax. Assuming it's a global rule, of course. I suspect the same about presentation
.
Also, should point out document-extractedText
and transcript
would be "new" use cases and are therefore probably not critical for implementation. Respectively, they would cover offering mine-able text ("fulltext indexing, such as a plaintext version of a document, or OCR text.") and transcripts/subtitles for movies/audio.
Both of these we don't really do now so this might involve a separate planning?
Those might warrant their own tickets and be tagged as enhancements. @arwenhutt and @gamontoya might know more about those than I do, but if it's new functionality it might need to float through the Products Groups as well. Not sure.
Respectively, they would cover offering mine-able text ("fulltext indexing, such as a plaintext version of a document, or OCR text.") and transcripts/subtitles for movies/audio.
How would these new use cases intersect with our current use of fulltext indexing for some items?
It sounds like the difference is they're supplied on ingest, rather than being generated from an existing file via our automated fulltext indexing. Is that right, @remerjohnson ?
Transcripts for A/V files would be totally new territory.
@mcritchlow That sounds correct. Although I am not aware of all the currently gets fulltext indexing so :smile:
Transcripts for A/V files would be totally new territory.
Yep, we just might want to be aware when determining processes for the ingested fulltext that they play nicely with extracted fulltext, and ideally can be handled by many of the same processes (for simplicity). I admit I don't know much about those current processes either though, just an impression that it's usually from pdf's and the fulltext is stored in the index.
Yes. It could be that our process will generate the extractedText
and we never supply one, much like we never supply thumbnails.
True. But going back to one of the first examples you mentioned, a supplied OCR file would be something that presumably wouldn't be extracted otherwise from an original file. At least not with our regular automated extraction process or at the desired level of quality/detail. So if that is something we need to support, we'll need to make sure that File type gets mapped to fulltext indexing, for example.
Ignore this if it's too much of a distraction, but since we're talking about file behavior, do we want to make an effort to make extractedText retrievable in some form? Maybe the index already does this, or it could be something to think about for a once and future api spec, or maybe it's something that we need to determine as part of the fileUse spec.
It seems like that text would potentially of interest to DH folks.
Yes, that use case was more of a Collections as Data idea
@remerjohnson @mcritchlow @lsitu FYI, the download the .mp3 derivative while allowing curators to download the source .wav file, no user is currently allowed to download .mp3 derivatives.
We do have these use cases:
@gamontoya Ah yes, those are more representative scenarios... I guess this is difficult because formats are tied to access and it's hard to anticipate all of that.
I think this makes it so that presentation
isn't needed...
@remerjohnson Yeah, we should get away from tying rules to format types.
Descriptive summary
We need to reach agreement on our mapping of existing file use properties in DAMS4 to DAMS5. Longshou noted this is the last (major) mapping/modeling task that we've yet to finish.
If a meeting is needed let's get that setup.
Rationale
Because data models.
Related work
46 - dependency