Open peterdesmet opened 2 years ago
I'm a bit lost here! I just checked and it seems to me there are no duplicates in the database. Is there any output (download or API) that has duplicated records?
I'm having a bit of trouble to see what is the issue here? When we output the DatarightsHolder we put it using the character '~'
Sorry, if I'm missing the obvious.
I just checked and it seems to me there are no duplicates in the database.
Good, I'll investigate #135 further
This is not about duplicate records. It is about avoiding unexpected results when users are left joining data:
campaigns AS c
LEFT JOIN samples AS s
ON c.campaignID = s.campaignID
LEFT JOIN positions AS p
ON s.sampleID = p.sampleID
LEFT JOIN observations AS o
ON p.positionID = o.positionID
One would expect to create a full denormalized (flat) table with all data this way. However, it is very likely more rows will be created than expected, because identifiers are only unique within a certain set. Note that the current data in the database might not reflect this issue.
So a better query would be:
campaigns AS c
LEFT JOIN samples AS s
ON c.dataRightsHolder = s.dataRightsHolder
AND c.campaignID = s.campaignID
LEFT JOIN positions AS p
ON s.dataRightsHolder = p.dataRightsHolder
AND s.campaignID = p.campaignID
AND s.sampleID = p.sampleID
LEFT JOIN observations AS o
ON p.dataRightsHolder = o.dataRightsHolder
AND p.campaignID = o.campaignID
AND p.sampleID = o.sampleID
ON p.positionID = o.positionID
Please correct me if I'm wrong.
I am not against adding all parent identifiers to the download tables... But correct me if I am wrong, even then users can still make the same incorrect joins, no? (when only joining on one identifier) I guess the point is that they will be less inclined to do so?
But correct me if I am wrong, even then users can still make the same incorrect joins, no? (when only joining on one identifier)
Indeed. The issue is that currently the cannot might the correct joins, because not all identifiers are provided.
@peterdesmet I agree with you. The download is incomplete! I can add the internal ID's of the database to all the files. That should make it clear and enough? So make it closer to what is the API.
https://esas.ices.dk/API/Download?DownloadID=973d90f7-6f33-4452-bb6b-5801768e89e7
the problem currently is that if two countries decide to have the same Campaign then there is a problem to join the files!
I need to think about this a bit more. One of the issues with the internal IDs is that one cannot query for them.
Also, in the API you would expect:
all_campaigns = getCampaigns
all_obs = empty
for (campaign in all_campaigns) {
obs = getObservations?campaignID=campaign # This can return observations from multiple campaigns
all_obs = all_obs + obs
)
obs # This can contain duplicates
Maybe we need stricter rules on the identifiers, e.g. a data rights holder cannot reuse a campaignID
and all sampleID
, positionID
and observationID
within a campaign need to be unique. I'll have a look in the data and see what is feasable.
@nicolasvanermen how comfortable are you with requiring people to upload data with a campaignID
that is unique for the whole database? This was previously done by reserving "blocks" of identifiers, but could also be done by prepending identifiers (in future submissions) e.g. INBO:123
.
I think this makes perfect sense, as this is the case in the database already. An acronym prefix makes sense as well.
campaignID
are uniquesampleID
are uniquepositionID
are uniqueobservationID
are unique (291,225 duplicates, such as 100001
)Given that, I would suggest (@nicolasvanermen can you confirm if you agree):
campaignID
is unique across the whole database (see https://github.com/ices-tools-dev/esas/issues/136#issuecomment-1305790507)sampleID
, positionID
and observationID
are unique within a campaignID even though most are currently unique across a wider scope. The alternative is to ask that sampleID
and positionID
are also unique across the whole database (currently the case), which would make it easier for users, but that's maybe too big of an ask for providers?The result is that people can safely join tables, as long as they include campaignID
in their join:
campaigns AS c
LEFT JOIN samples AS S
ON s.campaignID = c.campaignID
LEFT JOIN positions AS p
ON p.sampleID = s.sampleID
AND p.campaignID = c.campaignID
LEFT JOIN observations AS o
ON o.positionID = p.positionID
AND o.campaignID = c.campaignID
These changes require no migration of data (i.e. all current data already meet the requirements).
campaignID
is submitted that already exists in the database in combination with a different dataRightsHolder. Note campaignID
in combination with the same dataRightsHolder still indicates a resubmission.sampleID
, positionID
or observationID
are not unique in combination with the campaignID
. Note that this can be verified within the file (does not depend on data in the database).campaignID
)campaignID, sampleID
)campaignID
to results, right before sampleID
(campaignID, sampleID, positionID
). It is already a query parameter.campaignID
(and for good measure sampleID
) to results, right before positionID
(campaignID, sampleID, positionID, observationID
). Both are already query parameters.campaignID
)campaignID, sampleID
)campaignID
to results, right before sampleID
(campaignID, sampleID, positionID
)campaignID
(and for good measure sampleID
) to results, right before positionID
(campaignID, sampleID, positionID, observationID
)@nicolasvanermen I noticed you checked of some things in my issue. Does that mean that you prefer sampleID
and positionID
to only be unique within a campaign? Or do you think we can ask these to be unique across the whole database (currently the case) as well (which would simplify it for users)?
@peterdesmet sorry to solve this issue only now. A quick comment, in the webAPI there is no issue, because the records have the identifier:
so if two organizations submit the same campaignID, that will not be an issue (for people that use the API) Do you still want us to add CampaignID to the results?
About the download, changes are done:
@cmspinto yes, I would still like the changes to the API 1) so users or implementations don't have to rely on the tbl
fields and 2) so the API result is almost interchangeable with CSV result
so if two organizations submit the same campaignID, that will not be an issue (for people that use the API)
Note that we won't allow this at submission.
@nicolasvanermen I noticed you checked of some things in my issue. Does that mean that you prefer
sampleID
andpositionID
to only be unique within a campaign? Or do you think we can ask these to be unique across the whole database (currently the case) as well (which would simplify it for users)?
This is not a preference, rather a pragmatic approval. Of course it would be ideal that all keys are unique within the database... But in that case we should propose a specific 'system', for example through a data provider specific number or letter combination as prefix for each key...?
@nicolasvanermen, the platform is not affected by the users uploading the same CampaignID in distinct downloads. I think we are doing this only to make sure that when users download data they link distinct files correctly.
The QC check is here: That CampaingID was already used by another organization, please use another one.
All changes are in place now.
@cmspinto great to see that the check is in place and that CampaignID
s are now required to be unique across the database.
@nicolasvanermen as discussed between us, let's use the pragmatic approach and only require CampaignID
s to be unique across the database. Other IDs have to be unique within a CampaignID
s. All joins will have to take CampaignID
into account and should therefore be included in all CSV and API results.
OK, I will check this, by uploading a record of JNCC with a campaignID from INBO...
The duplicate campaignID is filtered out OK, but the message says: Cross field check (condition not met)
This seems to uninformative... Can the message be aligned with the check description ("That CampaingID was already used by another organization, please use another one.")?
All keys are now returned in the download, can this issue be closed @peterdesmet ?
Yes, they are also all returned by the API and all issues are resolved at https://github.com/ices-tools-dev/esas/issues/136#issuecomment-1307106286. I will update the documentation and close this issue.
Now clarified in the documentation: 80cc9a7
@cmspinto or @Osanna123 is it possible to update some of the descriptions in http://datsu.ices.dk/web/selRep.aspx?Dataset=148 to reflect the changes in 80cc9a7?
Typo in the screening message: "That CampaignID..." (instead of CampaingID)
And while your are at it... "This CampaignID..." sounds better :-)
Thanks, only remaining item is the documentation fix mentioned in https://github.com/ices-tools-dev/esas/issues/136#issuecomment-1326314607
Can we close this one?
No, there’s a remaining action item for you. The DATSU definitions should be updated. See https://github.com/ices-tools-dev/esas/issues/136#issuecomment-1326314607
I'm not sure this is currently an issue in the database, but it might become one:
Identifier requirements
campaignID
: should only be unique within aDataRightsHolder
sampleID
: should only be unique within acampaignID
positionID
: should only be unique within asampleID
observationID
: should only be unique within apositionID
Note that even within a single file submissions, identifiers are not expected to be unique, only within their respective parent. @nicholasvanermen @cmspinto can you confirm this?
Therefore:
Are all different records.
The issue
The CSV files that are returned, only list the direct parent identifier. So, if one would download the above data:
If the user would create joins between the file, they would get incorrect data. There is no way for the user to prevent this.
Solution 1 (does not work)
It is tempting to think that adding a
fileSubmissionID
(below230
and240
) to each CSV file solves the issue, but it does not:Solution 2
I think the only way to solve it is to add the data rightsholder and all the parent identifiers in all the CSV files. This is quite verbose, but at least straightforward and it solves incorrect joins:
API
In the API, there is already a way to prevent incorrect joins with:
However, I would also add the dataRightsHolder and all parent identifiers to the API, so the same joins can be used irrespective of CSV or API use.
@nicolasvanermen @cmspinto thoughts?