Make the relationship between Sample and Investigation many-to-many

RKrahl commented 4 years ago

We have the Sample class in our schema to represent "a sample to be used in an investigation". It has many-to-one relationship with Investigation, e.g. each sample must be related to one and only one investigation.

While this is certainly suitable for most situations, it may be problematic in some cases: if one individual sample has been the subject of more then one investigation, there is currently no way to properly represent this in our schema. I'd like to remind that we occasionally also have rather prominent samples such as archeological artifacts or pieces of art and it is not so improbable that the same item might be investigated more then once by independent groups using different techniques trying to answer unrelated questions.

I therefore suggest to add the following class to the schema:

`InvestigationSample`

Represents a many-to-many relationship between an investigation and a sample that has been used in that investigation

Constraint: investigation, sample

Relationships:

Card	Class	Field
1,1	Investigation	investigation
1,1	Sample	sample

As a result, the samples relationship in Investigation would be changed to have the type InvestigationSample rather then Sample. The Sample class would need to be modified as follows:

`Sample`

A sample to be used in one or more investigations

Constraint: facility, name

Relationships:

Card	Class	Field
1,1	Facility	facility
0,*	Dataset	datasets
0,1	SampleType	type
0,*	SampleParameter	parameters
0,*	InvestigationSample	investigationSamples

Other fields:

Field	Type	Description
name	String[255] NOT NULL
pid	String[255]	A persistent identifier attributed to this sample

E.g. the many-to-one relation with Investigation would be replaced by a one-to-many relation with InvestigationSample and I suggest to add a relationship with Facility just for the sake of consistency with the current schema where most things are related directly or indirectly to the facility on top. The uniqueness constraint would need to be adapted as well. The attributes of Sample would remain unchanged.

kevinphippsstfc commented 4 years ago

I agree with your proposed changes except in the area of constraints. I don't think a Sample needs to have any constraints. Regarding a new constraint on Facility, I agree that most other ICAT entities have this but this is (mostly!) because that makes sense. With an archaeological sample, for example, wouldn't it make sense that the sample might be taken to a number of facilities for analysis and is therefore not tied to any of them? I can see that this is unlikely to ever happen in a single ICAT, but nonetheless the data model should be right and we should allow for that possibility. That just leaves name as a constraint and there are likely to be duplicates of the names that people give to their samples (which is fine) so that constraint is not required either.

RKrahl commented 4 years ago

Interesting point!

In fact, I suggested this constraint not because it would be particularly sensible, but in the lack of any better solution. With the relation to investigation changed to be many-to-many, we can't have investigation any more in the constraint, so I removed it from the existing constraint and that's it.

In general, it is beneficial to have a constraint in all entity types, because it allows to reliably refer to an object in ICAT based on attribute values. That makes many things easier, including serializing and deserializing ICAT content.

Regarding the facility, that is a strange thing anyway. The schema imposes to have one facility object in each ICAT, but it does not serve any particular purpose. I assume there is no production ICAT having more then one facility object. Most entity classes directly or indirectly depend on the facility and those that relate directly to it also have it in the constraint. So I also kept the facility here, just for the sake of consistency with the remaining schema.

But I do appreciate your argument. And I admit that I'm not very happy with that constraint either.

RKrahl commented 4 years ago

Maybe an alternative solution would be to take the opportunity to make a clear cut: set the constraint to be just pid. After all, reliably identify things, even across repositories, that is what PIDs have been invented for. We might just start to use them seriously.

The only drawback I can see is that we would need to change the pid attribute to be NOT NULL and force every site to set some unique value. Of course it would be best to register real PIDs such as an IGSN in the case of samples. But if the site prefers not to do that, any unique value will do. An upgrade script could just set something like local:<id> (e.g. reusing the id attribute). That would work for the time being and could gradually be replaced by real PIDs later on.

stuartpullinger commented 4 years ago

I agree with @kevinphippsstfc. I was also thinking along the same lines as @RKrahl to use the pid. This seems sensible and to fall back on using the id attribute is a good idea.

kevinphippsstfc commented 4 years ago

I agree that if you are using PIDs or want to start using PIDs for Samples then this is a sensible approach. I can also see that getting an upgrade script to assign the ID attribute to the PID will work.

However, what happens for creating Samples once the upgrade has been done? Every piece of software creating samples is forced to implement something to create this (unique) field. This can of course be done and could be as simple as just calling a library method to generate a UUID, but it is not as simple as doing nothing, where your software continues to work because PID is an optional field that you don't bother setting. Or am I missing something obvious?

RKrahl commented 4 years ago

No I don't think you are missing something. That is indeed a drawback of using this constraint. The benefit is that we have a way to reliably refer to sample objects in the ICAT later on.

RKrahl commented 4 years ago

… in other words: either way you will break some piece of software. Either you break the software creating samples that is relying upon not having to bother setting the pid attribute, or you break software that is relying upon samples to have a constraint that can be used to reliably identify them. I'd argue, the former case is easier to work around then the latter.

dfq16044 commented 4 years ago

We may have the following case:

some samples may have a defined PID
other may not have any PID. If we need to fill the PID column, then we will not to create an 'artificial' PID.

Then how do you distinguish between the two? If we can set PID to null then the second case will be quite easy to identify as it is empty.

RKrahl commented 4 years ago

Then how do you distinguish between the two? If we can set PID to null then the second case will be quite easy to identify as it is empty.

It would make sense in any case to prefix a PID with a scheme, e.g. to set values like doi:10.1234/abcd, handle:20.1000/100, igsn:ESNFZDVHICBD, or uuid:d4d712a3-7444-4e4c-a14d-edcb833bdaa0. That would make it much easier to resolve them, even for official, well defined PIDs. If you do this systematically, for all pid values, it does not create any ambiguity, you can easily distinguish scheme prefix and proper pid value with

pid_scheme, pid_value = sample.pid.split(':', maxsplit=1)

If you do this, you only need to define a dedicated scheme prefix for local, 'artificial' PIDs, such as local:4711 and you are done.

dfq16044 commented 4 years ago

In this case, does this make sense to have a field called PIDScheme or similar for example like IdentifierScheme used in DataCite (for Person PID, organization PID). This will be may be clearer that the we need to specify which scheme we are using.

RKrahl commented 4 years ago

In this case, does this make sense to have a field called PIDScheme or similar for example like IdentifierScheme used in DataCite (for Person PID, organization PID). This will be may be clearer that the we need to specify which scheme we are using.

I don't think so. It would make things more complicated and I don't see an advantage over my suggestion to use a prefix.

dfq16044 commented 4 years ago

Well, from my experience free text field is very problematic in production systems. Having a way to specify how this field should be used would help. For example, the PID may not be spelled correctly and it will become very difficult to find the issue.

dfq16044 commented 4 years ago

I think this schema change will need more internal discussion within DLS. We are now reviewing the sample information workflow within DLS and we were already thinking on defining a PID for sample to be used as a reference across our local systems. My concern about PID is that by definition it represents a persistent identifier and not an unique identifier. Let's give an example: a User may have a OrcidID, a ResearcherID and a ScopusID that all points to the same user. Here we may have multiple PIDs referring to the sample: a local PID, a PID from a specific sample database. While we may have control over the local PID, for external PIDs this is more difficult. As mentioned in our discussion yesterday, we may have multiple experiments within DLS with the sample but locally this might have different PIDs (for example in the UAS sample might be assigned to the proposal). Later a user decide to use an external sample database and we want to be able to link those samples to the this PID. This will result on merging records but we will loose the reference to the original local PID.

RKrahl commented 5 months ago

Update: the current proposal as implemented in #294 is:

Add the following class:

`InvestigationSample`

Represents a many-to-many relationship between an investigation and a sample that has been used in that investigation

Constraint: investigation, sample

Relationships:

Card	Class	Field
1,1	Investigation	investigation
1,1	Sample	sample

Change Sample to:

`Sample`

A sample to be used in one or more investigations

Constraint: pid

Relationships:

Card	Class	Field
0,*	Dataset	datasets
0,1	SampleType	type
0,*	SampleParameter	parameters
0,*	InvestigationSample	investigationSamples

Other fields:

Field	Type	Description
name	String[255] NOT NULL
pid	String[255] NOT NULL	A persistent identifier attributed to this sample

Obviously: change the one-to-many relationship in Investigation from Sample to InvestigationSample.

RKrahl commented 4 months ago

As discussed in the ICAT Schema Discussion on April 2nd and in the collaboration meeting today, this change should be in a version 7.0 release that we aim to make in the second half of this year.

icatproject / icat.server