scientist-softserv / utk-hyku

Other
6 stars 0 forks source link

Spike: Conditional Derivative Generation #342

Closed jeremyf closed 1 year ago

jeremyf commented 1 year ago

Summary

For conditional derivative generation, I think the best approach will be to:

In the above implementation we’ll continue performing some of the derivative logic of the job. To remediate not performing some of that work would require further adjustments.

Discussion and Notes

Samvera Gem Versions for UTK:

class Hyrax::DerivativeService
  class_attribute :services
  self.services = [Hyrax::FileSetDerivativesService]
  def self.for(file_set)
    services.map { |service| service.new(file_set) }.find(&:valid?) ||
      new(file_set)
  end
  attr_reader :file_set
  delegate :mime_type, :uri, to: :file_set
  def initialize(file_set)
    @file_set = file_set
  end

  def cleanup_derivatives; end

  def create_derivatives(_file_path); end

  # What should this return?
  def derivative_url(_destination_name)
    ""
  end

  def valid?
    true
  end
end

The Hyrax::DerivativeService defines the interface for derivative services (and is itself a viable, albeit abstract derivative service).

The key method is Hyrax::DerivativeService.for; that is used to find the first valid service, with the fallback being an instance of that class.

The IIIF Print gem builds on the above by further configuring the Hyrax::DerivativeService.services class_attribute as follows:

Hyrax::DerivativeService.services.unshift(
  IiifPrint::PluggableDerivativeService
)

Which means the IiifPrint::PluggableDerivativeService is the first service we check followed by the fallback service.

Further Discussion

The Hyrax::CreateDerivativesJob#perform method (see below) leverages the create_derivatives functionality of the Hyrax::DerivativeService (via file_set.create_derivatives).

def perform(file_set, file_id, filepath = nil)
  return if file_set.video? && !Hyrax.config.enable_ffmpeg
  filename = Hyrax::WorkingDirectory.find_or_retrieve(file_id, file_set.id, filepath)

  file_set.create_derivatives(filename)

  # Reload from Fedora and reindex for thumbnail and extracted text
  file_set.reload
  file_set.update_index
  file_set.parent.update_index if parent_needs_reindex?(file_set)
end

Ideally, we would love to configure the application not to spawn the job if we don’t have a #valid? concrete derivative service for the given file_set (see the Hyrax::FileSet::Derivatives module). However, there are a few different ways that we invoke a derivative job; which means we likely need to adjust the #perform method instead.

Further complicating this is that the fallback derivative service (e.g. Hyrax::DerivativeService.new) is always valid. In other words, as implemented, every file_set has a “valid” derivative service; it just so happens that the fallback does nothing.

We’ll also want to consider how to change the custom override for the Hyrax::CreateDerivativesJobDecorator#perform.

def perform(file_set, file_id, filepath = nil, time_to_live = 2)
  return if file_set.video? && !Hyrax.config.enable_ffmpeg
  # OVERRIDE HYRAX 3.4.1 to skip derivative job unless rdf_type is "pcdm-muse:IntermediateFile"
  if file_set.parent_works.blank?
    raise 'CreateDerivatesJob Failed: FileSet is missing its parent' if time_to_live.zero?

    reschedule(file_set, file_id, filepath, time_to_live - 1)
    return false
  end

  return unless file_set.rdf_type&.join&.downcase&.include?(INTERMEDIATE_FILE)

  # Ensure a fresh copy of the repo file's latest version is being worked on, if no filepath is directly provided
  unless filepath && File.exist?(filepath)
    filepath = Hyrax::WorkingDirectory.copy_repository_resource_to_working_directory(
      Hydra::PCDM::File.find(file_id), file_set.id
    )
  end

  file_set.create_derivatives(filepath)

  # Reload from Fedora and reindex for thumbnail and extracted text
  file_set.reload
  file_set.update_index
  file_set.parent.update_index if parent_needs_reindex?(file_set)
end

In the above implementation, the “ensure a fresh copy” would be wonderful to have as a block for file_set.create_derivatives(filepath); however most implementations of the derivative work does not accept a block.


      file_set.create_derivatives(filepath) do
    unless filepath && File.exist?(filepath)
      filepath = Hyrax::WorkingDirectory.copy_repository_resource_to_working_directory(
        Hydra::PCDM::File.find(file_id), file_set.id
      )
    end
      end

In the above implementation, we’d only call the block for non-null derivative functions.

A final consideration is that we have ValkyrieCreateDerivativesJob#perform to consider. (See below)

def perform(_file_set_id, file_id, _filepath = nil)
  file_metadata = Hyrax.custom_queries.find_file_metadata_by(id: file_id)
  return if file_metadata.video? && !Hyrax.config.enable_ffmpeg
  # Get file into a local path.
  file = Hyrax.storage_adapter.find_by(id: file_metadata.file_identifier)
  # Call derivatives with the file_set.
  derivative_service = Hyrax::DerivativeService.for(file_metadata)
  derivative_service.create_derivatives(file.disk_path)
  # Trigger a reindex to get the thumbnail path.
  Hyrax.publisher.publish('file.metadata.updated', metadata: file_metadata, user: nil)
end
ShanaLMoore commented 1 year ago

What properties do derivatives get saved as? If they have their own, what would it look like to upload them and will OCR still work?

jeremyf commented 1 year ago

The IIIF Print does not presently have any logic regarding OCR of the files. Instead this is something called with-in other derivative services (see https://github.com/samvera/hyrax/blob/64c0bbf0dc0d3e1b49f040b50ea70d177cc9d8f6/app/services/hyrax/file_set_derivatives_service.rb#L123-L127)

If we want to run OCR on an intermediate file (e.g. one that is already a derivative), we will need to revisit how we're making the IIIF Print plugin.

What that would look like is to amend IiifPrint::PluggableDerivativeService

jillpe commented 1 year ago

closed in sprint 2/20/2023