Always return `CampaignID` and make it unique across database

peterdesmet commented 2 years ago

I'm not sure this is currently an issue in the database, but it might become one:

Identifier requirements

campaignID: should only be unique within a DataRightsHolder
sampleID: should only be unique within a campaignID
positionID: should only be unique within a sampleID
observationID: should only be unique within a positionID

Note that even within a single file submissions, identifiers are not expected to be unique, only within their respective parent. @nicholasvanermen @cmspinto can you confirm this?

Therefore:

dataRightsHolder,campaignID,sampleID,positionID,observationID
A               ,C1        ,S1      ,P1        ,O1
B               ,C1        ,S1      ,P1        ,O1
B               ,C1        ,S2      ,P1        ,O1
B               ,C2        ,S1      ,P1        ,O1

Are all different records.

The issue

The CSV files that are returned, only list the direct parent identifier. So, if one would download the above data:

# campaigns.csv
A,C1
B,C1
B,C2

# samples.csv
C1,S1
C1,S1 <- looks like a duplicate, but is from a different Data Rights Holder
C1,S2
C2,S1

## positions.csv
S1,P1
S1,P1 <- looks like a duplicate, but is from a different Data Rights Holder
S2,P1
S1,P1 <- looks like a duplicate, but is from a different campaign

## observations.csv
P1,O1
P1,O1 <- looks like a duplicate, but is from a different Data Rights Holder
P1,O1 <- looks like a duplicate, but is from a different campaign and sample
P1,O1 <- looks like a duplicate, but is from a different sample

If the user would create joins between the file, they would get incorrect data. There is no way for the user to prevent this.

Solution 1 (does not work)

It is tempting to think that adding a fileSubmissionID (below 230 and 240) to each CSV file solves the issue, but it does not:

# campaigns.csv
239,A,C1
240,B,C1
240,B,C2

# samples.csv
239,C1,S1
240,C1,S1 <- no longer a duplicate
240,C1,S2
240,C2,S1

## positions.csv
239,S1,P1
240,S1,P1 <- no longer a duplicate
240,S2,P1
240,S1,P1 <- still looks like a duplicate, but is from a different campaign

## observations.csv
239,P1,O1
240,P1,O1 <- no longer a duplicate
240,P1,O1 <- still looks like a duplicate, but is from a different campaign and sample
240,P1,O1 <- still looks like a duplicate, but is from a different sample

Solution 2

I think the only way to solve it is to add the data rightsholder and all the parent identifiers in all the CSV files. This is quite verbose, but at least straightforward and it solves incorrect joins:

# campaigns.csv
A,C1
B,C1
B,C2

# samples.csv
A,C1,S1
B,C1,S1 <- no longer a duplicate
B,C1,S2
B,C2,S1

## positions.csv
A,C1,S1,P1
B,C1,S1,P1 <- no longer a duplicate
B,C1,S2,P1
B,C2,S1,P1 <- no longer a duplicate

## observations.csv
A,C1,S1,P1,O1
B,C1,S1,P1,O1 <- no longer a duplicate
B,C1,S2,P1,O1 <- no longer a duplicate
B,C2,S1,P1,O1 <- no longer a duplicate

API

In the API, there is already a way to prevent incorrect joins with:

"tblUploadID": 234,
"tblCampaignID": 209487,
...

However, I would also add the dataRightsHolder and all parent identifiers to the API, so the same joins can be used irrespective of CSV or API use.

@nicolasvanermen @cmspinto thoughts?

cmspinto commented 2 years ago

I'm a bit lost here! I just checked and it seems to me there are no duplicates in the database. Is there any output (download or API) that has duplicated records?

I'm having a bit of trouble to see what is the issue here? When we output the DatarightsHolder we put it using the character '~'

Sorry, if I'm missing the obvious.

peterdesmet commented 2 years ago

I just checked and it seems to me there are no duplicates in the database.

Good, I'll investigate #135 further

This is not about duplicate records. It is about avoiding unexpected results when users are left joining data:

campaigns AS c
  LEFT JOIN samples AS s
    ON c.campaignID = s.campaignID
  LEFT JOIN positions AS p
    ON s.sampleID = p.sampleID
  LEFT JOIN observations AS o
    ON p.positionID = o.positionID

One would expect to create a full denormalized (flat) table with all data this way. However, it is very likely more rows will be created than expected, because identifiers are only unique within a certain set. Note that the current data in the database might not reflect this issue.

So a better query would be:

campaigns AS c
  LEFT JOIN samples AS s
    ON c.dataRightsHolder = s.dataRightsHolder
    AND c.campaignID = s.campaignID
  LEFT JOIN positions AS p
    ON s.dataRightsHolder = p.dataRightsHolder
    AND s.campaignID = p.campaignID
    AND s.sampleID = p.sampleID
  LEFT JOIN observations AS o
    ON p.dataRightsHolder = o.dataRightsHolder
    AND p.campaignID = o.campaignID
    AND p.sampleID = o.sampleID
    ON p.positionID = o.positionID

Please correct me if I'm wrong.

nicolasvanermen commented 2 years ago

I am not against adding all parent identifiers to the download tables... But correct me if I am wrong, even then users can still make the same incorrect joins, no? (when only joining on one identifier) I guess the point is that they will be less inclined to do so?

peterdesmet commented 2 years ago

But correct me if I am wrong, even then users can still make the same incorrect joins, no? (when only joining on one identifier)

Indeed. The issue is that currently the cannot might the correct joins, because not all identifiers are provided.

cmspinto commented 2 years ago

@peterdesmet I agree with you. The download is incomplete! I can add the internal ID's of the database to all the files. That should make it clear and enough? So make it closer to what is the API.

https://esas.ices.dk/API/Download?DownloadID=973d90f7-6f33-4452-bb6b-5801768e89e7

the problem currently is that if two countries decide to have the same Campaign then there is a problem to join the files!

peterdesmet commented 2 years ago

I need to think about this a bit more. One of the issues with the internal IDs is that one cannot query for them.

Also, in the API you would expect:

all_campaigns = getCampaigns
all_obs = empty
for (campaign in all_campaigns) {
  obs = getObservations?campaignID=campaign # This can return observations from multiple campaigns
  all_obs = all_obs + obs
)
obs # This can contain duplicates

Maybe we need stricter rules on the identifiers, e.g. a data rights holder cannot reuse a campaignID and all sampleID, positionID and observationID within a campaign need to be unique. I'll have a look in the data and see what is feasable.

peterdesmet commented 2 years ago

@nicolasvanermen how comfortable are you with requiring people to upload data with a campaignID that is unique for the whole database? This was previously done by reserving "blocks" of identifiers, but could also be done by prepending identifiers (in future submissions) e.g. INBO:123.

nicolasvanermen commented 2 years ago

I think this makes perfect sense, as this is the case in the database already. An acronym prefix makes sense as well.

peterdesmet commented 2 years ago

Current situation in the database

campaigns: 24888 records, all campaignID are unique
samples: 25309 records, all sampleID are unique
positions: 1656728 records, all positionID are unique
observations: 2957702 records, not all observationID are unique (291,225 duplicates, such as 100001)

Requirements

Given that, I would suggest (@nicolasvanermen can you confirm if you agree):

[x] That campaignID is unique across the whole database (see https://github.com/ices-tools-dev/esas/issues/136#issuecomment-1305790507)
[x] That sampleID, positionID and observationID are unique within a campaignID even though most are currently unique across a wider scope. The alternative is to ask that sampleID and positionID are also unique across the whole database (currently the case), which would make it easier for users, but that's maybe too big of an ask for providers?

The result is that people can safely join tables, as long as they include campaignID in their join:

campaigns AS c
  LEFT JOIN samples AS S
    ON s.campaignID = c.campaignID
  LEFT JOIN positions AS p
    ON p.sampleID = s.sampleID
    AND p.campaignID = c.campaignID
  LEFT JOIN observations AS o
    ON o.positionID = p.positionID
    AND o.campaignID = c.campaignID

These changes require no migration of data (i.e. all current data already meet the requirements).

Implementation (@Osanna123 @cmspinto)

[x] When uploading, QA should return an error if a campaignID is submitted that already exists in the database in combination with a different dataRightsHolder. Note campaignID in combination with the same dataRightsHolder still indicates a resubmission.
[x] When uploading, QA should return an error if sampleID, positionID or observationID are not unique in combination with the campaignID. Note that this can be verified within the file (does not depend on data in the database).
[x] API:
- [x] getCampaignRecords: no changes (campaignID)
- [x] getSampleRecords: no changes (campaignID, sampleID)
- [x] getPositionRecords: add campaignID to results, right before sampleID (campaignID, sampleID, positionID). It is already a query parameter.
- [x] getObservationRecords: add campaignID (and for good measure sampleID) to results, right before positionID (campaignID, sampleID, positionID, observationID). Both are already query parameters.
[x] CSV download:
- [x] campaigns.csv: no changes (campaignID)
- [x] samples.csv: no changes (campaignID, sampleID)
- [x] positions.csv: add campaignID to results, right before sampleID (campaignID, sampleID, positionID)
- [x] observations.csv: add campaignID (and for good measure sampleID) to results, right before positionID (campaignID, sampleID, positionID, observationID)

peterdesmet commented 2 years ago

@nicolasvanermen I noticed you checked of some things in my issue. Does that mean that you prefer sampleID and positionID to only be unique within a campaign? Or do you think we can ask these to be unique across the whole database (currently the case) as well (which would simplify it for users)?

cmspinto commented 2 years ago

@peterdesmet sorry to solve this issue only now. A quick comment, in the webAPI there is no issue, because the records have the identifier:

so if two organizations submit the same campaignID, that will not be an issue (for people that use the API) Do you still want us to add CampaignID to the results?

About the download, changes are done:

peterdesmet commented 2 years ago

@cmspinto yes, I would still like the changes to the API 1) so users or implementations don't have to rely on the tbl fields and 2) so the API result is almost interchangeable with CSV result

so if two organizations submit the same campaignID, that will not be an issue (for people that use the API)

Note that we won't allow this at submission.

nicolasvanermen commented 2 years ago

@nicolasvanermen I noticed you checked of some things in my issue. Does that mean that you prefer sampleID and positionID to only be unique within a campaign? Or do you think we can ask these to be unique across the whole database (currently the case) as well (which would simplify it for users)?

This is not a preference, rather a pragmatic approval. Of course it would be ideal that all keys are unique within the database... But in that case we should propose a specific 'system', for example through a data provider specific number or letter combination as prefix for each key...?

cmspinto commented 2 years ago

@nicolasvanermen, the platform is not affected by the users uploading the same CampaignID in distinct downloads. I think we are doing this only to make sure that when users download data they link distinct files correctly.

The QC check is here: That CampaingID was already used by another organization, please use another one.

All changes are in place now.

peterdesmet commented 2 years ago

@cmspinto great to see that the check is in place and that CampaignIDs are now required to be unique across the database.

@nicolasvanermen as discussed between us, let's use the pragmatic approach and only require CampaignIDs to be unique across the database. Other IDs have to be unique within a CampaignIDs. All joins will have to take CampaignID into account and should therefore be included in all CSV and API results.

nicolasvanermen commented 2 years ago

OK, I will check this, by uploading a record of JNCC with a campaignID from INBO...

nicolasvanermen commented 2 years ago

The duplicate campaignID is filtered out OK, but the message says: Cross field check (condition not met)

This seems to uninformative... Can the message be aligned with the check description ("That CampaingID was already used by another organization, please use another one.")?

cmspinto commented 2 years ago

Done: http://datsu.ices.dk/web/chkDetails.aspx?DatasetID=148&RecordID=103&groupFilterID=1981

nicolasvanermen commented 2 years ago

All keys are now returned in the download, can this issue be closed @peterdesmet ?

peterdesmet commented 2 years ago

Yes, they are also all returned by the API and all issues are resolved at https://github.com/ices-tools-dev/esas/issues/136#issuecomment-1307106286. I will update the documentation and close this issue.

peterdesmet commented 2 years ago

Now clarified in the documentation: 80cc9a7

@cmspinto or @Osanna123 is it possible to update some of the descriptions in http://datsu.ices.dk/web/selRep.aspx?Dataset=148 to reflect the changes in 80cc9a7?

nicolasvanermen commented 1 year ago

Typo in the screening message: "That CampaignID..." (instead of CampaingID)

And while your are at it... "This CampaignID..." sounds better :-)

cmspinto commented 1 year ago

Done: http://datsu.ices.dk/web/rptChk.aspx?Dataset=148

peterdesmet commented 1 year ago

Thanks, only remaining item is the documentation fix mentioned in https://github.com/ices-tools-dev/esas/issues/136#issuecomment-1326314607

cmspinto commented 1 year ago

Can we close this one?

peterdesmet commented 1 year ago

No, there’s a remaining action item for you. The DATSU definitions should be updated. See https://github.com/ices-tools-dev/esas/issues/136#issuecomment-1326314607

ices-tools-dev / esas