Indexing object's filesize using Inventory data might be unreliable

giulioturetta commented 3 years ago

When a new object is created and its metadata are indexed in Solr, the corresponding object in the foxml.ds MongoDB collection might not yet exist. This will result in a missing "size" field in the Solr record of the object. During its iterations, Reposcan script will then find the new object and insert it in MongoDB, but indexing won't be performed again unless object's metadata are updated or indexing is manually triggered.

Possible solution: lookup the path in fedora DB and stat the file directly when indexing.

giulioturetta commented 3 years ago

Fixed in https://github.com/phaidra/phaidra-api/commit/b833f91f241554774d47cc096eaa9476240161f9.

giulioturetta commented 3 years ago

@RastislavHudak Could the same fix apply to all _get_dsinfo_filesize calls so that OCTETS.0 doesn't need to be hardcoded?

yurj commented 3 years ago

Also _get_dsinfo_xml in ./PhaidraAPI/Model/Search.pm and dsinfo.cgi have OCTETS.0 hardcoded. Fixing _get_dsinfo_xml will fix also all the calls to _get_dsinfo_filesize in DC and Datacite (but not dsinfo.cgi)

The fix can be useful because you can then replace a file using the fedora-admin.sh client -> OCTETS -> import (setting the policy to replace). This will replace the file but the end part of the name change to OCTETS.1. The fix could use the code in Model/Index.pm:

    my $octets_mdoel = PhaidraAPI::Model::Octets->new;
    my $parthres     = $octets_mdoel->_get_ds_path($c, $pid, 'OCTETS');
    if ($parthres->{status} == 200) {
      $index{size} = -s $parthres->{path};
    }

to extract the size. So _get_dsinfo_xml (it actually is used only to retrieve the size) can be something like:

sub _get_dsinfo_xml {

    my ($self, $c, $pid, $cmodel) = @_;
    my start_dsinfo = '<di:dsinfo xmlns:di="http://phaidra.univie.ac.at/XML/dsinfo/V1.0" xmlns:exif="http://phaidra.univie.ac.at/XML/exif/V1.0"><di:filesize>';
    my $end_dsinfo = '</di:filesize>';

    my $filesize = 0;
    my $octets_model = PhaidraAPI::Model::Octets->new;
    my $parthres     = $octets_model->_get_ds_path($c, $pid, 'OCTETS');
    if ($parthres->{status} == 200) {
      $file_size = -s $parthres->{path};
    }
    return $start_dsinfo.$filesize.$end_dsinfo
}

phaidra / phaidra-api

Indexing object's filesize using Inventory data might be unreliable #60