Closed RKrahl closed 6 months ago
It might be useful to give some examples for common search cases to illustrate the changes that would result from this proposal.
Search for investigation related to the current user (e.g. populate the "My Data" tab in TopCAT):
SELECT i FROM Investigation i JOIN i.investigationUsers AS iu JOIN iu.user AS u WHERE u.name = :user
With the proposed schema change, this would become:
SELECT i FROM Investigation i JOIN i.investigationUsers AS iu JOIN iu.userInfo AS ui JOIN ui.user AS u WHERE u.name = :user
E.g. there is one additional class in the join of related objects.
Search for user name and role related to an investigation:
SELECT u.fullName, iu.role FROM User u JOIN u.investigationUsers AS iu JOIN iu.investigation i WHERE i.id = 7645
With the proposed schema change, this would become:
SELECT ui.fullName, iu.role FROM UserInfo ui JOIN ui.investigationUsers AS iu JOIN iu.investigation i WHERE i.id = 7645
E.g. there is almost no difference in the search expression, but the search would operate on UserInfo
rather then on User
.
Get the e-mail address of the current user:
SELECT u.email FROM User u WHERE u.name = :user
With the proposed schema change, this would become:
SELECT ui.email FROM UserInfo ui JOIN ui.user u WHERE ui.email IS NOT NULL AND u.name = :user ORDER BY ui.preference DESC LIMIT 0, 1
Since there may be more then one UserInfo
objects for the same user, we need to pick the preferred one. This example demonstrates the use of the preference
attribute in UserInfo
.
I would have a concern about the UserInfo entity relating to the rules system.
Currently there are two fundamental rules that most ICATs are likely to have in place:
Firstly for an InvestigationUser's access to Datafiles:
SELECT df FROM Datafile df JOIN df.dataset d JOIN d.investigation i JOIN i.investigationUsers iu JOIN iu.user u WHERE u.name = :user
Secondly for an InstrumentScientist's access to Datafiles:
SELECT df FROM Datafile df JOIN df.dataset d JOIN d.investigation i JOIN i.investigationInstruments ii JOIN ii.instrument inst JOIN inst.instrumentScientists instSci JOIN instSci.user u WHERE u.name = :user
When a user (InvUser or InstSci) is requesting access to Datafiles (typically either by browsing in TopCAT or when the IDS is processing a retrieval request on their behalf) both of these rules are added to the query that finally gets executed on the database making that query potentially very time consuming to process. I am in the process of fixing an issue that we have on the Diamond IDS caused by just this problem, and I would be concerned that if we insert another entity (UserInfo) between InstrumentScientist/InvestigationUser and User then it makes these queries even worse.
The potential impact would need testing but this in itself is not easy. I think it only becomes apparent on an ICAT where you have billions of Datafiles and for users who have access to a lot of data (as InvUsers or InstSci - not as an admin user). This makes it difficult to test and is further complicated in the Diamond situation where we do not have a full database on the test environment due to the size of it.
I don't quite understand the preference
field in UserInfo. Does it get populated with an autogenerated number from a sequence so that the UserInfos for a user can be ordered by the order they were inserted into the database? Or is the ingest software that is taking care of inserting and updating user details supposed to manage these numbers? Or is it more of a boolean flag to say "this is the current UserInfo for the user", in which case, should it be a boolean value?
@kevinphippsstfc, the concerns about performance are certainly an important point to take into account. I'm not convinced that this proposal would have a severe impact, though. It is true that it adds one table more to search in the query, but it essentially replaces one many to one relation from InvestigationUser
to User
by two, from InvestigationUser
to UserInfo
and from UserInfo
to User
. Many to one relations are straight forward to evaluate and most importantly, this schema change does not add more complexity to the query.
So as a bottom line, yes, you are right, we need to test this and evaluate the impact on performance. But this is not necessarily a show stopper and I would not be surprised if it turns out that it does not have a significant impact.
And finally, for the issue with the long running queries in ids.server
, you already provided a fix that looks like a valid work around.
On your second question about the preference
, the idea is the following: there may be several UserInfo
instances with a different set of attributes related to a single User
, maybe because that user was involved in multiple data publications. In that case, we would need to know which set of attributes to use. In the context of an investigation or a data publication this is defined by the relation between UserInfo
and Investigation
or DataPublicationUser
respectively. But in the general case, we would need some means to know, which UserInfo
is the preferred one. For instance, the web user interface may need to display the user's name and thus needs to pick one out of the possibly multiple name variants or if it needs to send an e-mail to the user, it needs to know which address to use, if there is more then one. The preference
attribute is a very simple way to tag the preferred UserInfo
instance: pick the one with the highest preference
value. How to select this one is shown in the third query example above. The preference
value is arbitrary and has no other meaning then to make this distinction.
How to manage that preference
attribute is left to the local site to decide. Different policies are conceivable. I provide just two examples in Python syntax. Let's assume you define a function new_user_info()
to set a new set of user's attributes. (To simplify things, I only consider email
and fullName
.) Using the current schema, it might look like:
def new_user_info(client, user, email=None, fullName=None):
user.email = email
user.fullName = fullName
user.update()
One option to manage the preference
might be based on the assumption to always prefer the set of attributes that is added last. In that case, we would need to create new UserInfo
instances with a preference higher then any existing ones:
def new_user_info(client, user, email=None, fullName=None):
try:
query = Query(client, "UserInfo",
conditions={"user.id": "= %d" % user.id},
attribute='preference',
order=[('preference', 'DESC')],
limit=(0, 1))
prev_pref = client.assertedSearch(query)[0]
except icat.SearchAssertionError:
prev_pref = 0
info = client.new("userInfo", user=user, preference=prev_pref+1,
email=email, fullName=fullName)
info.create()
return info
Another option might be to keep one dedicated preferred UserInfo
instance that should always remain the same. That instance would be created once for each user with a high preference, lets say 1000. Adding a new UserInfo
not interfering with that existing instance only requires setting an arbitrary preference that is unique and smaller then 1000:
def new_user_info(client, user, email=None, fullName=None):
try:
query = Query(client, "UserInfo",
conditions={"user.id": "= %d" % user.id,
"preference": "< 1000"},
attribute='preference',
order=[('preference', 'DESC')],
limit=(0, 1))
prev_pref = client.assertedSearch(query)[0]
except icat.SearchAssertionError:
prev_pref = 0
info = client.new("userInfo", user=user, preference=prev_pref+1,
email=email, fullName=fullName)
info.create()
return info
Minor comment on this issue - in line with other models/schemas, I would name Affiliation
as Organisation
as the PID will come from ROR or other similar service.
Not sure if I agree. The Affiliation
table is meant to represent the affiliation string for a particular user in a particular context, such as in a given publication. It will often not be the top level organization, but rather include the department and street address.
To illustrate, we have at present the following distinct affiliation entries for HZB in our data publications:
Helmholtz-Zentrum Berlin für Materialien und Energie, Hahn-Meitner-Platz 1, 14109 Berlin, Germany
Helmholtz-Zentrum Berlin für Materialien und Energie, Albert-Einstein-Str. 15, 12489 Berlin, Germany
Institute of Applied Materials, Helmholtz-Zentrum Berlin für Materialien und Energie, Hahn-Meitner-Platz 1, 14109 Berlin, Germany
Optics for Solar Energy, Helmholtz-Zentrum Berlin für Materialien und Energie, Albert-Einstein-Straße 16, 12489 Berlin
Each of these entries are about the same organisation, HZB and would have the same, HZB's ROR. Still the name
attribute would differ. So, this is not a table of organisations, but rather a table of affiliation strings being used in a given context.
But I do agree that the discussion on that table's name is a minor issue.
As pointed out above, this has been proposed to address concerns that have been raised in the discussion of introducing DataPublication
to the schema three years ago. Since then, these concerns have found a different solution. The underlying issue that we need to keep a snapshot of some user attributes that may potentially change with time, while still keeping the identity of the user, has been solved for the special case of data publications.
There is no current need for the UserInfo
that was proposed here. And since this would be a rather disruptive change, we should not do it without a good reason. I thus propose to close this issue as wontfix
for the time being. Still, we may revive the issue should we ever need to face that underlying issue in the general case.
Please note: for the moment, this is more an idea that would need thorough discussion, rather then a fully elaborated proposal. In particular because it would be disruptive.
In #200, I proposed schema extensions to support data publication workflows. This proposal was addressing a particular use case of making curated data publications. It included a new entity
DataPublicationUser
that was needed because “affiliations and names of users may change over time. That's is why we need to keep track of affiliation and name at the time of the publication. Relying on the corresponding attributes ofUser
only may not be enough.”There was a longer discussion in that issue whether a
DataPublication
should be related to aDataCollection
as proposed or to anInvestigation
. There was no conclusion, but I believe what @dfq16044 actually wanted to achieve was to get certain features out of this proposal to be applicable also to the different use case of simply minting DOIs for investigations instead. One of these features she wanted to retain for that other use case was that ‘freezing’ of user attributes at the time of minting the DOI.Now, if one contemplates more deeply about this, I believe the core of the problem is the double nature of the
User
class in our current schema. At one hand, we need it to represent a person, in order to provide a means to authenticate that person, to manage access privileges and to record relations to other entities such as investigations. On the other hand, we use it to store attributes likegivenName
,familyName
,email
, andaffiliation
of that person that are to be used in a particular context. While the person always remains the same and we need a persistent record of that person, the attributes are subject to change and may need to be different according to the context. Until now, we were only interested in the current state and the latest values for these attributes, so this did not become a problem. The data publication was the first use case that needed to keep a record of the attributes at some particular moment in time. The discussion revealed that this is not the only case where this is needed and I suspect more use cases will come in the future. So it seems, we need to address this issue fundamentally.The idea for a solution would be to separate the two purposes of the
User
class: keepUser
to represent the person and add another classUserInfo
for the potentially volatile attributes of that person. More then oneUserInfo
instances may exist for the sameUser
, but eachUserInfo
only relates to oneUser
. At the same time, there is another issue withaffiliation
: an author of a publication may need to have more then one affiliation in that publication. So the proposal suggests the following classes:User
A user of the facility
Constraint:
name
Relationships:
Other fields:
UserInfo
Attributes of a user in a context
Constraint: user, preference
Relationships:
Other fields:
Affiliation
The home institute or other affiliation of a user
Constraint: user, name
Relationships:
Other fields:
The
user
relations inInvestigationUser
,Study
, andInstrumentScientist
would be changed to point to aUserInfo
accordingly. Obviously, theDataPublicationUser
implemented in #232 would need to be amended as follows:DataPublicationUser
Author, e.g. creator of a or contributor to a data publication
Constraint:
publication
,user
,contributorType
Relationships:
Other fields: