icatproject / icat.oaipmh

OAI-PMH implementation for ICAT
Apache License 2.0
0 stars 1 forks source link

Leverage JPQL search expressions in the configuration #27

Open RKrahl opened 1 year ago

RKrahl commented 1 year ago

The current configuration is too complicated and inefficient. It might be simplified if we could leverage JPQL search expressions in the config files.

Due to the design of icat.oaipmh, we need to configure, which properties from which objects in ICAT to consider for an object to be disseminated over OAI-PMH in a first step. From this, an internal XML representation of these objects is created. In a second step, this internal representation is transformed using XSLT.

Only to compile all the ICAT entity objects needed for the metadata of a data publication, the following configuration lines are needed:

# Identifiers for the configuration of metadata to be retrieved from ICAT
data.configurations = datapub

# Relevant data objects and properties for each data configuration
data.datapub.mainObject = DataPublication

data.datapub.stringProperties = pid title description subject
data.datapub.numericProperties = id
data.datapub.dateProperties = publicationDate
data.datapub.subPropertyLists = users dates relatedItems fundingReferences content

data.datapub.users.stringProperties = orderKey fullName givenName familyName contributorType email
data.datapub.users.subPropertyLists = user affiliations
data.datapub.users.user.stringProperties = orcidId
data.datapub.users.affiliations.stringProperties = name pid fullReference

data.datapub.dates.stringProperties = dateType date

data.datapub.relatedItems.stringProperties = identifier relationType fullReference relatedItemType title

data.datapub.fundingReferences.subPropertyLists = funding
data.datapub.fundingReferences.funding.stringProperties = funderIdentifier funderName awardNumber awardTitle

data.datapub.content.subPropertyLists = dataCollectionDatasets
data.datapub.content.dataCollectionDatasets.subPropertyLists = dataset
data.datapub.content.dataCollectionDatasets.dataset.numericProperties = fileSize
data.datapub.content.dataCollectionDatasets.dataset.subPropertyLists = datafiles
data.datapub.content.dataCollectionDatasets.dataset.datafiles.subPropertyLists = datafileFormat
data.datapub.content.dataCollectionDatasets.dataset.datafiles.datafileFormat.stringProperties = type

This seems to be too clumsy.

Roughly the same could be achieved with a single JPQL search expression:

SELECT dp FROM DataPublication dp INCLUDE dp.content AS dc, dc.dataCollectionDatafiles AS dcdf, dcdf.datafile AS df1, df1.datafileFormat, dc.dataCollectionDatasets AS dcds, dcds.dataset AS ds, ds.datafiles AS df2, df2.datafileFormat, dp.dates, dp.fundingReferences AS dpfun, dpfun.funding, dp.relatedItems, dp.users AS dpu, dpu.affiliations, dpu.user

Furthermore, the internal XML representation roughly corresponds one to one to the ICAT schema. This means that if we want to include the experimental techniques being used in an investigation, we need to include all datasets from that investigation in the internal representation, which might look something like:

<metadata>
  <datasets>
    <instance>
      <datasetTechniques>
    <instance>
      <technique>
        <name>neutron diffraction</name>
        <pid>PaNET:PaNET01217</pid>
      </technique>
    </instance>
      </datasetTechniques>
    </instance>
    <instance>
      <!-- ... -->
    </instance>
    <instance>
      <!-- ... -->
    </instance>
    <!-- ... -->
  </datasets>
  <!-- ... -->
</metadata>

Note that there may be hundreds of datasets in one investigation. Often they all have the same technique, but that is not guaranteed. The distinct techniques must then be extracted from that using XSLT, which is also somewhat involved.

In princlple, we could select the list of distinct techniques related to an investigation using one simple JPQL search statement like:

SELECT DISTINCT(t) FROM Technique t JOIN t.datasetTechniques AS dst JOIN dst.dataset AS ds JOIN ds.investigation AS i WHERE i.id = %d

(where the %d would need to be substituted with the internal id of that investigation.)

So if we could compile the internal XML representation by a couple JPQL searches configured in the config file, things might become significantly simpler.

RKrahl commented 1 year ago

I just noticed yet another benefit of this approach: at the moment, it is not possible to disseminate only a subset of the objects for a data configuration. In the run.properties one must specify for each data configuration the ICAT object which will be the main source of information when retrieving metadata from ICAT. The icat.oaipmh component will then unconditionally disseminate all objects of that type. There is no way to put a condition to filter the objects.