mjordan / riprap

A PREMIS-compliant fixity checking microservice.
MIT License
13 stars 7 forks source link

Creat a fetchresourcelist plugin that queries Drupal for media to check #14

Closed mjordan closed 5 years ago

mjordan commented 6 years ago

Related to #6 and https://github.com/Islandora-CLAW/CLAW/issues/945.

We should have a fetchresourcelist plugin that queries Drupal for resources to check. The code below is a working proof of concept. It requires that the Drupal JSON API contrib module is enabled.

<?php
// content type will need to be a Riprap admin option, as will limit (but
// apparently the JSON API's max is 50 items per page). In this example, 
// we requrest page 2 with a size of 3 nodes.
$page_url = "http://localhost:8000/jsonapi/node/islandora_object?page[offset]=2&page[limit]=3";

$page_output = file_get_contents($page_url);
$page_output = json_decode($page_output, true);

// Taxonomy terms to check will need to be a Riprap admin option.
$taxonomy_terms_to_check = array('/taxonomy/term/2'); // "Preservation master"

// At this point, we have a list of 3 nodes.
foreach ($page_output['data'] as $node) {
  $nid = $node['attributes']['nid']; 
  // Get the media associated with this node using the Islandora-supplied Manage Media View.
  $media_url = "http://admin:islandora@localhost:8000/node/" . $nid . "/media?_format=json";
  $media_data = file_get_contents($media_url);
  $media_data = json_decode($media_data);
  // Loop through all the media and pick the ones that
  // are tagged with terms in $taxonomy_terms_to_check.
  foreach ($media_data as $media) {
    if (count($media->field_tags)) {
      foreach ($media->field_tags as $term) {
        if (in_array($term->url, $taxonomy_terms_to_check)) {
          // @todo: Convert to the equivalent Fedora URL and add to the plugin's output.
          // @todo: Add option to not convert to Fedora URL if the site doesn't use Fedora.
          // In that case, we need to figure out how to get Drupal's checksum for the file over HTTP.
          var_dump($media->field_media_image[0]->url);
        }
      }
    }
  }
}

We will also need to persist the page number to request during the next scheduled job. This should probably go into a db table.

mjordan commented 6 years ago

Once https://github.com/Islandora-Devops/migrate_7x_claw/pull/9 gets merged, the above code should look like:

<?php
// content type will need to be a Riprap admin option, as will limit (but
// apparently the JSON API's max is 50 items per page). In this example, 
// we requrest page 2 with a size of 3 nodes.
$page_url = "http://localhost:8000/jsonapi/node/islandora_object?page[offset]=2&page[limit]=3";

$page_output = file_get_contents($page_url);
$page_output = json_decode($page_output, true);

// Taxonomy terms to check will need to be a Riprap admin option.
// "Original File" and "Preservation Master File"
$taxonomy_terms_to_check = array('/taxonomy/term/15', '/taxonomy/term/16');

// At this point, we have a list of 3 nodes.
foreach ($page_output['data'] as $node) {
  $nid = $node['attributes']['nid']; 
  // Get the media associated with this node using the Islandora-supplied Manage Media View.
  $media_url = "http://admin:islandora@localhost:8000/node/" . $nid . "/media?_format=json";
  $media_data = file_get_contents($media_url);
  $media_data = json_decode($media_data);
  // Loop through all the media and pick the ones that
  // are tagged with terms in $taxonomy_terms_to_check.
  foreach ($media_data as $media) {
    if (count($media->field_media_use)) {
      foreach ($media->field_media_use as $term) {
        if (in_array($term->url, $taxonomy_terms_to_check)) {
          // @todo: Convert to the equivalent Fedora URL by querying Gemini
          // using the value of $media->field_media_image[0]->target_uuid to get this type of response:
          // {
          //  "drupal":"http:\/\/localhost:8000\/_flysystem\/fedora\/masters\/testing_12_OBJ.jpg",
          //  "fedora":"http:\/\/localhost:8080\/fcrepo\/rest\/masters\/testing_12_OBJ.jpg"
          // }
          // The Fedora URL is the one Riprap needs to validate the fixity of.
          // @todo: Add option to not convert to Fedora URL if the site doesn't use Fedora.
          // In that case, we need to figure out how to get Drupal's checksum for the file over HTTP.
        }
      }
    }
  }
}
mjordan commented 5 years ago

According to https://www.drupal.org/docs/8/modules/jsonapi/sorting, we can:

mjordan commented 5 years ago

We'll also need to include Basic auth credentials in Riprap for the JSON API and Views REST.

mjordan commented 5 years ago

Work in the issue-14 branch can now parse out the Drupal URLs of images attached to nodes:

php bin/console app:riprap:check_fixity
string(57) "http://localhost:8000/_flysystem/fedora/testing_8_OBJ.jpg"
string(57) "http://localhost:8000/_flysystem/fedora/testing_7_OBJ.jpg"
string(57) "http://localhost:8000/_flysystem/fedora/testing_6_OBJ.jpg"

This comes from each media entity's field_media_image field. We need to make sure that non-image files are also detected (i.e., what field do we use for non-image files?).

mjordan commented 5 years ago

Non-image files are in field_media_file.

mjordan commented 5 years ago

Only thing not working is the authenticating against Gemini using a JWT token.

mjordan commented 5 years ago

app:riprap:plugin:fetchresourcelist:from:drupal plugin is complete, but I'm getting some strange behavior. When riprap hits the last page of a JSON:API request, it throws a curl error:

In CurlFactory.php line 186:

  cURL error 3: <url> malformed (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)  

However, the URL triggering this error works as expected (200 response code) when requested using curl on the command line, e.g., curl -v -uadmin:islandora "http://localhost:8000/jsonapi/node/islandora_object?page%5Blimit%5D=5&page%5Boffset%5D=10&sort=-changed".

mjordan commented 5 years ago

OK, have tracked this down to an empty $media_list on a node.

mjordan commented 5 years ago

Closed with 342ba237f0448c82cbedf4b3d5ad78ac03697990.