Add new entity classes UserInfo and Affiliation

RKrahl commented 3 years ago

Please note: for the moment, this is more an idea that would need thorough discussion, rather then a fully elaborated proposal. In particular because it would be disruptive.

In #200, I proposed schema extensions to support data publication workflows. This proposal was addressing a particular use case of making curated data publications. It included a new entity DataPublicationUser that was needed because “affiliations and names of users may change over time. That's is why we need to keep track of affiliation and name at the time of the publication. Relying on the corresponding attributes of User only may not be enough.”

There was a longer discussion in that issue whether a DataPublication should be related to a DataCollection as proposed or to an Investigation. There was no conclusion, but I believe what @dfq16044 actually wanted to achieve was to get certain features out of this proposal to be applicable also to the different use case of simply minting DOIs for investigations instead. One of these features she wanted to retain for that other use case was that ‘freezing’ of user attributes at the time of minting the DOI.

Now, if one contemplates more deeply about this, I believe the core of the problem is the double nature of the User class in our current schema. At one hand, we need it to represent a person, in order to provide a means to authenticate that person, to manage access privileges and to record relations to other entities such as investigations. On the other hand, we use it to store attributes like givenName, familyName, email, and affiliation of that person that are to be used in a particular context. While the person always remains the same and we need a persistent record of that person, the attributes are subject to change and may need to be different according to the context. Until now, we were only interested in the current state and the latest values for these attributes, so this did not become a problem. The data publication was the first use case that needed to keep a record of the attributes at some particular moment in time. The discussion revealed that this is not the only case where this is needed and I suspect more use cases will come in the future. So it seems, we need to address this issue fundamentally.

The idea for a solution would be to separate the two purposes of the User class: keep User to represent the person and add another class UserInfo for the potentially volatile attributes of that person. More then one UserInfo instances may exist for the same User, but each UserInfo only relates to one User. At the same time, there is another issue with affiliation: an author of a publication may need to have more then one affiliation in that publication. So the proposal suggests the following classes:

`User`

A user of the facility

Constraint: name

Relationships:

Card	Class	Field
0,*	UserInfo	infos
0,*	UserGroup	userGroups

Other fields:

Field	Type	Description
name	String[255] NOT NULL	The name of the user to match that provided by the authentication mechanism

`UserInfo`

Attributes of a user in a context

Constraint: user, preference

Relationships:

Card	Class	Field
1,1	User	user
0,*	Affiliation	affiliations
0,*	InvestigationUser	investigationUsers
0,*	Study	studies
0,*	InstrumentScientist	instrumentScientists

Other fields:

Field	Type	Description
preference	Integer NOT NULL	An arbitrary number to indicate whether this is the preferred UserInfo record related to the user
email	String [255]	An email address for the user
familyName	String [255]	The family name of the user
fullName	String [255]	May include title
givenName	String [255]	The given name of the user
orcidId	String [255]	An ORCID iD for the user

`Affiliation`

The home institute or other affiliation of a user

Constraint: user, name

Relationships:

Card	Class	Field
1,1	UserInfo	user

Other fields:

Field	Type	Description
name	String[1023] NOT NULL	Name and optionally address of the affiliation
pid	String [255]	Identifier such as ROR or ISNI

The user relations in InvestigationUser, Study, and InstrumentScientist would be changed to point to a UserInfo accordingly. Obviously, the DataPublicationUser implemented in #232 would need to be amended as follows:

`DataPublicationUser`

Author, e.g. creator of a or contributor to a data publication

Constraint: publication, user, contributorType

Relationships:

Card	Class	Field
1,1	DataPublication	publication
1,1	UserInfo	user

Other fields:

Field	Type	Description
contributorType	String[255] NOT NULL	See DataCite property contributorType or use "Creator"
orderKey	String[255]	Defines an order among the contributors

RKrahl commented 3 years ago

It might be useful to give some examples for common search cases to illustrate the changes that would result from this proposal.

Search for investigation related to the current user (e.g. populate the "My Data" tab in TopCAT):

SELECT i FROM Investigation i JOIN i.investigationUsers AS iu JOIN iu.user AS u WHERE u.name = :user

With the proposed schema change, this would become:

SELECT i FROM Investigation i JOIN i.investigationUsers AS iu JOIN iu.userInfo AS ui JOIN ui.user AS u WHERE u.name = :user

E.g. there is one additional class in the join of related objects.

Search for user name and role related to an investigation:

SELECT u.fullName, iu.role FROM User u JOIN u.investigationUsers AS iu JOIN iu.investigation i WHERE i.id = 7645

With the proposed schema change, this would become:

SELECT ui.fullName, iu.role FROM UserInfo ui JOIN ui.investigationUsers AS iu JOIN iu.investigation i WHERE i.id = 7645

E.g. there is almost no difference in the search expression, but the search would operate on UserInfo rather then on User.

Get the e-mail address of the current user:
```
SELECT u.email FROM User u WHERE u.name = :user
```
With the proposed schema change, this would become:
```
SELECT ui.email FROM UserInfo ui JOIN ui.user u WHERE ui.email IS NOT NULL AND u.name = :user ORDER BY ui.preference DESC LIMIT 0, 1
```
Since there may be more then one UserInfo objects for the same user, we need to pick the preferred one. This example demonstrates the use of the preference attribute in UserInfo.

kevinphippsstfc commented 3 years ago

I would have a concern about the UserInfo entity relating to the rules system.

Currently there are two fundamental rules that most ICATs are likely to have in place: Firstly for an InvestigationUser's access to Datafiles: SELECT df FROM Datafile df JOIN df.dataset d JOIN d.investigation i JOIN i.investigationUsers iu JOIN iu.user u WHERE u.name = :user Secondly for an InstrumentScientist's access to Datafiles: SELECT df FROM Datafile df JOIN df.dataset d JOIN d.investigation i JOIN i.investigationInstruments ii JOIN ii.instrument inst JOIN inst.instrumentScientists instSci JOIN instSci.user u WHERE u.name = :user

When a user (InvUser or InstSci) is requesting access to Datafiles (typically either by browsing in TopCAT or when the IDS is processing a retrieval request on their behalf) both of these rules are added to the query that finally gets executed on the database making that query potentially very time consuming to process. I am in the process of fixing an issue that we have on the Diamond IDS caused by just this problem, and I would be concerned that if we insert another entity (UserInfo) between InstrumentScientist/InvestigationUser and User then it makes these queries even worse.

The potential impact would need testing but this in itself is not easy. I think it only becomes apparent on an ICAT where you have billions of Datafiles and for users who have access to a lot of data (as InvUsers or InstSci - not as an admin user). This makes it difficult to test and is further complicated in the Diamond situation where we do not have a full database on the test environment due to the size of it.

kevinphippsstfc commented 3 years ago

I don't quite understand the preference field in UserInfo. Does it get populated with an autogenerated number from a sequence so that the UserInfos for a user can be ordered by the order they were inserted into the database? Or is the ingest software that is taking care of inserting and updating user details supposed to manage these numbers? Or is it more of a boolean flag to say "this is the current UserInfo for the user", in which case, should it be a boolean value?

RKrahl commented 3 years ago

@kevinphippsstfc, the concerns about performance are certainly an important point to take into account. I'm not convinced that this proposal would have a severe impact, though. It is true that it adds one table more to search in the query, but it essentially replaces one many to one relation from InvestigationUser to User by two, from InvestigationUser to UserInfo and from UserInfo to User. Many to one relations are straight forward to evaluate and most importantly, this schema change does not add more complexity to the query.

So as a bottom line, yes, you are right, we need to test this and evaluate the impact on performance. But this is not necessarily a show stopper and I would not be surprised if it turns out that it does not have a significant impact.

And finally, for the issue with the long running queries in ids.server, you already provided a fix that looks like a valid work around.

RKrahl commented 3 years ago

On your second question about the preference, the idea is the following: there may be several UserInfo instances with a different set of attributes related to a single User, maybe because that user was involved in multiple data publications. In that case, we would need to know which set of attributes to use. In the context of an investigation or a data publication this is defined by the relation between UserInfo and Investigation or DataPublicationUser respectively. But in the general case, we would need some means to know, which UserInfo is the preferred one. For instance, the web user interface may need to display the user's name and thus needs to pick one out of the possibly multiple name variants or if it needs to send an e-mail to the user, it needs to know which address to use, if there is more then one. The preference attribute is a very simple way to tag the preferred UserInfo instance: pick the one with the highest preference value. How to select this one is shown in the third query example above. The preference value is arbitrary and has no other meaning then to make this distinction.

How to manage that preference attribute is left to the local site to decide. Different policies are conceivable. I provide just two examples in Python syntax. Let's assume you define a function new_user_info() to set a new set of user's attributes. (To simplify things, I only consider email and fullName.) Using the current schema, it might look like:

def new_user_info(client, user, email=None, fullName=None):
    user.email = email
    user.fullName = fullName
    user.update()

One option to manage the preference might be based on the assumption to always prefer the set of attributes that is added last. In that case, we would need to create new UserInfo instances with a preference higher then any existing ones:

def new_user_info(client, user, email=None, fullName=None):
    try:
        query = Query(client, "UserInfo",
                  conditions={"user.id": "= %d" % user.id},
                  attribute='preference',
                  order=[('preference', 'DESC')],
              limit=(0, 1))
        prev_pref = client.assertedSearch(query)[0]
    except icat.SearchAssertionError:
        prev_pref = 0
    info = client.new("userInfo", user=user, preference=prev_pref+1,
                      email=email, fullName=fullName)
    info.create()
    return info

Another option might be to keep one dedicated preferred UserInfo instance that should always remain the same. That instance would be created once for each user with a high preference, lets say 1000. Adding a new UserInfo not interfering with that existing instance only requires setting an arbitrary preference that is unique and smaller then 1000:

def new_user_info(client, user, email=None, fullName=None):
    try:
        query = Query(client, "UserInfo",
                  conditions={"user.id": "= %d" % user.id,
                          "preference": "< 1000"},
                  attribute='preference',
                  order=[('preference', 'DESC')],
              limit=(0, 1))
        prev_pref = client.assertedSearch(query)[0]
    except icat.SearchAssertionError:
        prev_pref = 0
    info = client.new("userInfo", user=user, preference=prev_pref+1,
                      email=email, fullName=fullName)
    info.create()
    return info

agbeltran commented 3 years ago

Minor comment on this issue - in line with other models/schemas, I would name Affiliation as Organisation as the PID will come from ROR or other similar service.

RKrahl commented 3 years ago

Not sure if I agree. The Affiliation table is meant to represent the affiliation string for a particular user in a particular context, such as in a given publication. It will often not be the top level organization, but rather include the department and street address.

To illustrate, we have at present the following distinct affiliation entries for HZB in our data publications:

Helmholtz-Zentrum Berlin für Materialien und Energie, Hahn-Meitner-Platz 1, 14109 Berlin, Germany
Helmholtz-Zentrum Berlin für Materialien und Energie, Albert-Einstein-Str. 15, 12489 Berlin, Germany
Institute of Applied Materials, Helmholtz-Zentrum Berlin für Materialien und Energie, Hahn-Meitner-Platz 1, 14109 Berlin, Germany
Optics for Solar Energy, Helmholtz-Zentrum Berlin für Materialien und Energie, Albert-Einstein-Straße 16, 12489 Berlin

Each of these entries are about the same organisation, HZB and would have the same, HZB's ROR. Still the name attribute would differ. So, this is not a table of organisations, but rather a table of affiliation strings being used in a given context.

But I do agree that the discussion on that table's name is a minor issue.

RKrahl commented 6 months ago

As pointed out above, this has been proposed to address concerns that have been raised in the discussion of introducing DataPublication to the schema three years ago. Since then, these concerns have found a different solution. The underlying issue that we need to keep a snapshot of some user attributes that may potentially change with time, while still keeping the identity of the user, has been solved for the special case of data publications.

There is no current need for the UserInfo that was proposed here. And since this would be a rather disruptive change, we should not do it without a good reason. I thus propose to close this issue as wontfix for the time being. Still, we may revive the issue should we ever need to face that underlying issue in the general case.

icatproject / icat.server