icatproject / icat.server

The ICAT server offering both SOAP and "RESTlike" interfaces to a metadata catalog.
Other
1 stars 5 forks source link

ICAT schema extensions for ICAT Server 5.0 release #256

Closed EmilJunker closed 2 years ago

EmilJunker commented 2 years ago

This PR collects (i.e. obsoletes) several existing PRs and provides a clean, conflict free way to merge them into master.

Obsoletes PR #232, closes #200 closes #249, includes parts of #248. Obsoletes PR #233, closes #211, closes #238. Obsoletes PR #234, closes #230. Obsoletes PR #235, closes #228.

EmilJunker commented 2 years ago

I've done some testing and it turned out that renaming investigationSize and datasetSize to size as proposed in https://github.com/icatproject/icat.server/pull/233#issuecomment-896870524 caused problems with Oracle because SIZE is apparently a registered keyword in PL/SQL. That is why I have now renamed the size attribute to fileSize for both Investigations and Datasets. This works well and is also in line with the fileCount attribute name.

RKrahl commented 2 years ago

I've done some testing and it turned out that renaming investigationSize and datasetSize to size as proposed in #233 (comment) caused problems with Oracle because SIZE is apparently a registered keyword in PL/SQL. That is why I have now renamed the size attribute to fileSize for both Investigations and Datasets. This works well and is also in line with the fileCount attribute name.

Note: as an alternative to renaming the attribute, one could also override the column name in the database using the Column annotation. E.g. something like:

public class Investigation extends EntityBaseBean implements Serializable {

    @Comment("The cumulated sizes of the datasets within this investigation")
    @Column(name = "SIZE_")
    private Long size;

    @Comment("The total number of datafiles within this investigation")
    private Long fileCount;
}

I'd be happy with either solution. To be discussed in the collaboration meeting.

RKrahl commented 2 years ago

I get an error from the upgrade_mysql_5_0.sql script trying to upgrade from icat.server 4.11.1:

$ mysql icat < upgrade_mysql_5_0.sql
ERROR 1025 (HY000) at line 117: Error on rename of './icat/#sql-76_54' to './icat/AFFILIATION' (errno: 150 "Foreign key constraint is incorrectly formed")
RKrahl commented 2 years ago

I get an error from the upgrade_mysql_5_0.sql script trying to upgrade from icat.server 4.11.1:

There is an apparently related issue when deploying icat.server during a fresh install. I get the following suspicious entry in the server.log:

[2021-11-18T11:34:04.093+0100] [Payara 4.1] [WARNING] [] [javax.org.glassfish.persistence.org.glassfish.persistence.common] [tid: _ThreadID=32 _ThreadName=admin-thread-pool::admin-listener(1)] [timeMillis: 1637231644093] [levelValue: 900] [[
  PER01000: Got SQLException executing statement "ALTER TABLE AFFILIATION ADD CONSTRAINT FK_AFFILIATION_DATAPUBLICATIONUSER_ID FOREIGN KEY (DATAPUBLICATIONUSER_ID) REFERENCES DATAPUBLICATIONUSER (ID)": java.sql.SQLException: Can't create table `icat`.`AFFILIATION` (errno: 150 "Foreign key constraint is incorrectly formed")]]

Apart from that log entry, the installation seem to succeed. I believe that icat.server simply ignores SQL errors during install. But if one inspects the SQL dump after installation, it turns out that the foreign key constraint is indeed missing from the AFFILIATION table:

CREATE TABLE `AFFILIATION` (
  `ID` bigint(20) NOT NULL AUTO_INCREMENT,
  `CREATE_ID` varchar(255) NOT NULL,
  `CREATE_TIME` datetime NOT NULL,
  `MOD_ID` varchar(255) NOT NULL,
  `MOD_TIME` datetime NOT NULL,
  `NAME` varchar(1023) NOT NULL,
  `PID` varchar(255) DEFAULT NULL,
  `DATAPUBLICATIONUSER_ID` bigint(20) NOT NULL,
  PRIMARY KEY (`ID`),
  UNIQUE KEY `UNQ_AFFILIATION_0` (`DATAPUBLICATIONUSER_ID`,`NAME`) USING HASH
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
RKrahl commented 2 years ago

I get an error from the upgrade_mysql_5_0.sql script trying to upgrade from icat.server 4.11.1:

Ok, I found it.

I got more details about that foreign key error running SHOW ENGINE INNODB STATUS in the database:

2021-11-18 11:21:13 0x7f38402a8700 Error in foreign key constraint of table icat/AFFILIATION:
there is no index in the table which would contain
the columns as the first columns, or the data types in the
table do not match the ones in the referenced table
or one of the ON ... SET NULL columns is declared NOT NULL. Constraint:
,
  CONSTRAINT `FK_AFFILIATION_DATAPUBLICATIONUSER_ID` FOREIGN KEY (`DATAPUBLICATIONUSER_ID`) REFERENCES `DATAPUBLICATIONUSER` (`ID`)

Actually, there is an index in the table which contains DATAPUBLICATIONUSER_ID as the first column, but it is declared USING HASH, see above. Apparently such an index is unusable for a foreign key, which is kind of logic. It seems that the reason for the USING HASH declaration was that the name attribute having 1023 characters is too wide to be used in an index directly. In any case, if one reduces the name attribute to 511 characters (which I guess, will still be enough), everything works.

RKrahl commented 2 years ago

This PR adds or modifies the following entity classes:

New classes

Affiliation

The home institute or other affiliation of a user in the context of a data publication

Uniqueness constraint: user, name

Relationships:

Card Class Field
1,1 DataPublicationUser user

Other fields:

Field Type Description
name String [255] NOT NULL An internal name for that affiliation entry, possibly the organization name
pid String [255] Identifier such as ROR or ISNI
fullReference String [1023] The full reference of the affiliation, optionally including street address and department, as it should appear in the publication

DataCollectionInvestigation

Represents a many-to-many relationship between a DataCollection and its Investigations.

Uniqueness constraint: dataCollection, investigation

Relationships:

Card Class Field
1,1 DataCollection dataCollection
1,1 Investigation investigation

DataPublication

A curated data publication

Uniqueness constraint: facility, pid

Relationships:

Card Class Field
1,1 Facility facility
1,1 DataCollection content
0,1 DataPublicationType type
0,* DataPublicationDate dates
0,* DataPublicationFunding fundingReferences
0,* DataPublicationUser users
0,* RelatedIdentifier relatedIdentifiers

Other fields:

Field Type Description
pid String [255] NOT NULL Persistent Identifier of the publication, such as a DOI
title String [255] NOT NULL Title of the publication
publicationDate Date Date when the data was made publicly available
description String [4000] Abstract
subject String [1023] List of keywords

DataPublicationDate

A date relevant for the publication

Uniqueness constraint: publication, dateType

Relationships:

Card Class Field
1,1 DataPublication publication

Other fields:

Field Type Description
dateType String [255] NOT NULL Type of the date, see DataCite property dateType for suggested values
date String [255] NOT NULL Use ISO 8601 format, may also be a range

DataPublicationFunding

Represents a many-to-many relationship between a data publication and a funding reference

Uniqueness constraint: dataPublication, funding

Relationships:

Card Class Field
1,1 FundingReference funding
1,1 DataPublication dataPublication

DataPublicationType

A type of data publication, for example, whole investigation, user-selected datasets/files. This is likely to be for facility internal purposes following their own classification scheme and allowing, for example, the front end to display them in different ways.

Uniqueness constraint: facility, name

Relationships:

Card Class Field
1,1 Facility facility
0,* DataPublication dataPublications

Other fields:

Field Type Description
name String [255] NOT NULL A short name identifying this data publication type within the facility
description String [255] A description of this data publicaion type

DataPublicationUser

Author, e.g. creator of a or contributor to a data publication

Uniqueness constraint: publication, user, contributorType

Relationships:

Card Class Field
1,1 DataPublication publication
1,1 User user
0,* Affiliation affiliations

Other fields:

Field Type Description
contributorType String [255] NOT NULL Role of that user in the publication, see DataCite property contributorType for suggested values or use "Creator"
orderKey String [255] Defines an order among the contributors
givenName String [255] The given name of the user
familyName String [255] The family name of the user
fullName String [255] May include title

DatasetInstrument

Represents a many-to-many relationship between an instrument and a dataset with data collected at that instrument

Uniqueness constraint: dataset, instrument

Relationships:

Card Class Field
1,1 Dataset dataset
1,1 Instrument instrument

DatasetTechnique

Represents a many-to-many relationship between a dataset and the experimental technique being used to create that Dataset

Uniqueness constraint: dataset, technique

Relationships:

Card Class Field
1,1 Dataset dataset
1,1 Technique technique

FundingReference

Information about financial support

Uniqueness constraint: funderName, awardNumber

Relationships:

Card Class Field
0,* DataPublicationFunding publications
0,* InvestigationFunding investigations

Other fields:

Field Type Description
funderName String [255] NOT NULL Name of the funding entity
funderIdentifier String [255] Unique identifier of the funding entity, such as a Crossref Funder ID
awardNumber String [255] NOT NULL Code assigned by the funder to identify the grant, suggest to use ":unas" if there is no such identifier
awardTitle String [255] Title or name of the grant

InvestigationFacilityCycle

Many to many relationship between investigation and facilityCycle. Allows investigations to belong to multiple cycles at once.

Uniqueness constraint: facilityCycle, investigation

Relationships:

Card Class Field
1,1 Investigation investigation
1,1 facilityCycle facilityCycle

InvestigationFunding

Represents a many-to-many relationship between an investigation and a funding reference

Uniqueness constraint: investigation, funding

Relationships:

Card Class Field
1,1 Investigation investigation
1,1 FundingReference funding

RelatedIdentifier

Identifier of a related resource to a data publication

Uniqueness constraint: publication, identifier

Relationships:

Card Class Field
1,1 DataPublication publication

Other fields:

Field Type Description
identifier String [255] NOT NULL The identifier of the related resource
relationType String [255] NOT NULL Description of the relationship with the related resource, see DataCite property relationType for suggested values
fullReference String [1023] The full reference for the related resource as it should be displayed on the landing page

Technique

Represents an experimental technique

Uniqueness constraint: name

Relationships:

Card Class Field
0,* DatasetTechnique datasetTechniques

Other fields:

Field Type Description
name String [255] NOT NULL A short name for the technique
pid String [255] A persistent identifier attributed to the technique, ideally referring to a vocabulary term
description String [255] An informal description for the technique

Modified classes

Not mentioned here: entity classes that are the target of a many-to-one relation from one of the new classes get the corresponding reverse one-to-many relation added.

Dataset

New fields:

Field Type Description
fileCount Long The total number of datafiles within this dataset
fileSize Long The cumulated sizes of the data files within this dataset

Investigation

New fields:

Field Type Description
fileCount Long The total number of datafiles within this investigation
fileSize Long The cumulated sizes of the datasets within this investigation

Edit: update Affiliation class as a result of the merge of #260 Edit: update after the merge of #271, #283, and #263

RKrahl commented 2 years ago

I mentioned in the meeting last Thursday that there might be an issue with new entity classes and the export and import calls in icat.server. I checked this and was proven wrong: icat.server turns out to be better than I thought, it just works, even for the new classes.

There is however one issue with the fileCount and fileSize attributes in Dataset and Investigation. If the database triggers are in places, the values will be wrong after an export / import cycle.

kevinphippsstfc commented 2 years ago

Thanks @RKrahl. That's good news about the export and import just working for the new classes.

It's a shame about the fileCount and fileSize fields. Can this be fixed or would it be too much work to be worth it?

My feeling is that if it can't be fixed easily then we have to live with that and probably just add a warning to any documentation that (hopefully!) exists on the export/import functionality. The workaround after doing an import would presumably be to either re-calculate all the values using the INITIALIZE_DS/INV_SIZE_COUNT procedures in the ICAT 5 migration scripts (assuming you have a fairly small ICAT), or to edit the procedure to only update the Investigations that you have modified if you have a large ICAT and it's too expensive to re-calculate all values in the whole ICAT.

RKrahl commented 2 years ago

It's a shame about the fileCount and fileSize fields. Can this be fixed or would it be too much work to be worth it?

I'd say, it can't easily be fixed. It is due to the design decisions we made for the fileCount / fileSize feature: we decided to implement the feature with triggers in the database backend and to make installing the triggers optional. That means, icat.server has no way to know whether the triggers are in place. The triggers do not recalculate the values but do only incremental updates when a Datafile is created, deleted or updated. Recalculating would have been too expensive performance-wise. And we decided not to enforce correct values on create or update of Database and Investigation objects. So it's always possible to create a Dataset or an Investigation with "wrong" values.

Now, what happens is: the export writes Investigations and Datasets with their (presumably correct) attribute values to the export file. The import reads that file and creates Investigations and Datasets with the (correct) values they have set in the file. Then the import reads the Datafiles from the file and creates them. This causes the database triggers to update fileCount and fileSize in the corresponding Dataset and Investigation objects. bummer: as result, the values are doubled.

My feeling is that if it can't be fixed easily then we have to live with that and probably just add a warning to any documentation that (hopefully!) exists on the export/import functionality.

Yes.

The workaround after doing an import would presumably be to either re-calculate all the values

That would be my advise. As an alternative, one could temporarily disable the triggers during the import.

[…] if you have a large ICAT and it's too expensive to re-calculate all values in the whole ICAT.

I don't believe import or export are ever used on a large production ICAT.

RKrahl commented 2 years ago

I now merged #260 into this branch and updated the comment describing the new classes above accordingly.

RKrahl commented 2 years ago

I now merged the other pending PRs requesting schema changes into this one. From my point of view, it's ready. I did not test this with Oracle, though. @kevinphippsstfc, could you please check again.

kevinphippsstfc commented 2 years ago

Thanks @RKrahl. I will test this branch on Oracle.

kevinphippsstfc commented 2 years ago

I've done a mvn install with the the test setup using an Oracle database and the tests all pass apart from: TestRS.testSearch:691->search:1034 expected:<4> but was:<5> which is the test that PR #275 aims to fix. Did you manage to take a look at that yet?

kevinphippsstfc commented 2 years ago

I've done a mvn install with the the test setup using an Oracle database and the tests all pass apart from: TestRS.testSearch:691->search:1034 expected:<4> but was:<5> which is the test that PR #275 aims to fix. Did you manage to take a look at that yet?

RKrahl commented 2 years ago

I've done a mvn install with the the test setup using an Oracle database and the tests all pass apart from: TestRS.testSearch:691->search:1034 expected:<4> but was:<5> which is the test that PR #275 aims to fix. Did you manage to take a look at that yet?

Not yet, that will be my next action tomorrow.