ucsdlib / damsrepo

DAMS Repository
Other
4 stars 2 forks source link

All Fedora datastreams are marked as XML #85

Closed mcritchlow closed 5 years ago

mcritchlow commented 5 years ago

This issue came up with the Notch8 VRR work.

Essentially, the value of the Fedora datastream dsControlGroup for a given datastream is hard-coded in our fedora-datastream-profile as X which means XML/inline.

There is perhaps better documentation about the types of datastreams, but they're at least partly described here: https://wiki.duraspace.org/display/FEDORA38/REST+API#RESTAPI-addDatastream

We should:

This is causing the Rubydora code that is trying to check if an object's datastreams have changed to fail, because it's trying to compare XML datastreams for things like files, which it should not be doing.

    def content_changed?
      return false if ['E','R'].include? controlGroup
      return true if new? and !local_or_remote_content(false).blank? # new datastreams must have content

      if controlGroup == "X"
        if self.eager_load_datastream_content
          return !EquivalentXml.equivalent?(Nokogiri::XML(content), Nokogiri::XML(datastream_content))
        else
          return !EquivalentXml.equivalent?(Nokogiri::XML(content), Nokogiri::XML(@datastream_content))
        end
      else
        if self.eager_load_datastream_content
          return local_or_remote_content(false) != datastream_content
        else
          return local_or_remote_content(false) != @datastream_content
        end
      end
      super
    end
lsitu commented 5 years ago

@mcritchlow It seems like there are a couple of bigger issues that we may need to inspect and discuss before moving forward with a solution:

  1. What work_obj.datastreams should behave? Should it load any datastream contents or is it a hydra bug? I see damsrepo return the correct result for all datascreams in the test object bd24789869:

    <objectDatastreams xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dams="http://library.ucsd.edu/ontology/dams#" xmlns="http://www.fedora.info/definitions/1/0/access/" pid="bd24789869" baseURL="http://lib-hydratail-staging.ucsd.edu:8080/dams/fedora/" xsi:schemaLocation="http://www.fedora.info/definitions/1/0/access/ http://www.fedora-commons.org/definitions/1/0/listDatastreams.xsd">
    <datastream dsid="DC" label="Dublin Core Record for this object" mimeType="text/xml"/>
    <datastream dsid="RELS-EXT" label="Fedora Object-to-Object Relationship Metadata" mimeType="application/rdf+xml"/>
    <datastream dsid="rightsMetadata" label="" mimeType="text/xml"/>
    <datastream dsid="damsMetadata" label="DAMS RDF metadata" mimeType="application/rdf+xml"/>
    <datastream dsid="_rdf.xml" label="DAMS RDF serialized metadata" mimeType="application/rdf+xml"/>
    <datastream dsid="_1.mp4" label="" mimeType="video/mp4"/>
    <datastream dsid="_2.mp4" label="" mimeType="video/mp4"/>
    <datastream dsid="_3.jpg" label="" mimeType="image/jpeg"/>
    <datastream dsid="_4.jpg" label="" mimeType="image/jpeg"/>
    </objectDatastreams>
  2. What should method def content_changed? do? Does it necessary to compare the content since the datastream profile contains the size and checksum that should be good enough to determine any changes to the content file? Will the content file be empty with local_or_remote_content(false) that may cause other issues? How is the block if controlGroup == "X" triggered the error since all datastream profile should be the X dsControlGroup, which seems to be correct?

mcritchlow commented 5 years ago

@lsitu - So as for (1), I was just using .datastreams as an example. The issue really manifests itself in the VRR work when calling .save on an object. As far as I can tell reading through that method, it's trying to identify if changes were made and whether it actually needs to save them.

Here's where this is getting triggered https://github.com/samvera/rubydora/blob/v1.7.5/lib/rubydora/digital_object.rb#L284 which calls content_changed?

As for (2), it is hard to sort through. I think you're on to the real question, which to me is what are the potential side-effects of having file datastreams have a non-X control group. It seems like it would get us around this particular issue. But might cause others. Either way it's definitely wrong though.. The non-XML datastreams shouldn't be marked as XML.

I think for our purposes it seems like we want content_changed? to return false ALWAYS for any file datastreams. At least that's my initial impression, since we won't be adding an editor to actually manipulate file datastreams in the hydra app (at least i hope not.. :) ). It seems that if mark the file datastreams with the E control group, that would do the trick. I'm curious what you think.

mcritchlow commented 5 years ago

How is the block if controlGroup == "X" triggered the error since all datastream profile should be the X dsControlGroup, which seems to be correct?

I'm not sure I follow this part. I think the error is triggered specifically because every datastream has control group set to 'X'. But I may be misunderstanding you on this. I think this ties in to your question about whether this is a 'hydra bug'. I think in the way you framed it, I don't think so. I think the hydra code is trying to be 'smart' about only saving datastreams that have actually changed, and we're running into an issue where our repo is correctly returning all datastreams, but they're marked in a way that's perhaps not consistent with the Fedora3 spec? I'm not sure that's the best way to put it, but it seems like by marking all our object datastreams as 'X' control group, we're not following proper conventions/expectations that ActiveFedora has.

mcritchlow commented 5 years ago

It looks like if E or R is used as the control group, then the dsLocation must be set to the correct URL.

https://github.com/samvera/active_fedora/blob/6-7-stable/lib/active_fedora/datastream_collections.rb#L36

https://wiki.duraspace.org/display/FEDORA38/REST+API#RESTAPI-addDatastream

And it seems that we're doing that for the profile already https://github.com/ucsdlib/damsrepo/blob/develop/src/xsl/fedora-datastream-profile.xsl#L29

It does make me wonder why we fixed controlGroup in the first place to X, but that's probably too many years ago to sort out..

lsitu commented 5 years ago

@mcritchlow Yes, we could change it if that is the problem. But method .datastreams return the datastream content is the expected behavior? And if the method content_changed? just load the profile for the datastreams to comparing the change, then X is good to catch the data profile as RDF/XML to compare the content change in the file. If we need to load the datastream content, will the overhead be affordable for a huge tar ball?

mcritchlow commented 5 years ago

That's a good question, but my impression is that it's already loading the datastream content. When my terminal 'crashed' while doing initial debugging, it was clear that it was pulling down binary content. I'm wary of saying yes to your question of expected behavior, but we're dealing with such old versions of ActiveFedora and Rubydora that... who knows. I suspect the real, possibly only, answer to your question is we'd need to test.

lsitu commented 5 years ago

@mcritchlow Yes, then I think we can start from change the dsControlGroup to see how it goes. Or is it possible to update the method content_changed? to just compare the datastream profile for content file changes, which is in RDF/XML format that won't trigger the error?

mcritchlow commented 5 years ago

Hmm. I mean we could fork the gem and update that method. But I'd be nervous about taking on that responsibility. I suppose it's not obviously more risk than updating the dsControlGroup in all our file datastreams, but at least we own that. I don't know, what do you lean towards?

lsitu commented 5 years ago

@mcritchlow I've added PR https://github.com/ucsdlib/damsrepo/pull/86 for it. Let's test it to see how it goes for now. Thanks.

mcritchlow commented 5 years ago

Yeah I think this looks like a great test. Thanks @lsitu !