samvera / hydra-works

A ruby gem implementation of the PCDM Works domain model based on the Samvera software stack
Other
24 stars 14 forks source link

add_external_file_to_file_set vs. add_file_to_file_set issues #332

Open atz opened 7 years ago

atz commented 7 years ago

Serious copy-pasta, 90+% similarity:

Also both are classes, but each defines two other classes inside the outer one. That is something we conventionally do in module namespace, but not in classes. The service classes obscure that that they are actually mostly about Updater classes. That's where much of the commonality is, and that should be broken out or otherwise reused.

Ideally, these should be refactored to use a common base class or establish an inheritance.

Other problems:

type param

YARD says:

@param [RDF::URI or String] type URI for the RDF.type that identifies the file's role within the file_set

but that eventually gets passed to type_to_uri that can raise ArgumentError with the message 'Invalid file type. You must submit a URI or a symbol.' String or symbol, which is it? RDF::URI doesn't care since it just wants #to_s, in fact always calls #to_s on the first param received. Why do we care more that it does? (but not enough to match docs and error message?)

Transactionality

In production these methods will routinely be called in parallel on the same fileset, with batches of hundreds or thousands of objects filling as many job workers as are available. So why would it work to do (in #persist):

if current_file.new_record?
  file_set.save

when the file_set being saved has no guarantee of not overwriting changes from the 11 other processors saving divergently modified versions of what was initially the same object? If there is already some magic in place to avoid this, I haven't found it.

Keep in mind, for our generated FileSet class, the number of ancestors distinct from regular ruby Object ancestors, (FileSet.ancestors - Object.ancestors).count, is 84. Would you be able to guess which of them received the .save in question? The odds are good you would be wrong. Turns out it is ActiveFedora::Associations::Builder::Orders::FixFirstLast, which gets dynamically patched onto our model here: https://github.com/samvera/active_fedora/blob/master/lib/active_fedora/associations/builder/orders.rb#L38

I can get that far, but of course it calls super (in what is sure to be a long chain), but which one of them maybe applies magic transactional pixiedust, I can't say, but it doesn't matter since it itself calls save again via another method it monkey-patched onto our class.

atz commented 7 years ago

The risk of FileSet transactionality for Hyrax IngestJob may be limited by filesets being created per upload, but this is still an architectural flaw. During migrations, for example, a dozen add_file_to_file_set calls could quite plausibly be entertained on the same fileset in parallel.

atz commented 7 years ago

The Updater classes do:

if type.instance_of? Symbol

in one place and in another:

case type ... when String

This confused sniffing leads to actual logical/behavioral differences.

atz commented 7 years ago

Note, in certain configurations, the #read required of the file param is actually #read(INTEGER, STRING)