humlab-sead / sead_change_control

Sane SEAD change control using Sqitch.
1 stars 0 forks source link

Major error detected for reference column in tbl_datasets #79

Closed Loer0022 closed 9 months ago

Loer0022 commented 4 years ago

When going through tbl_biblio to check that publications are present and correct it was detected that the connections made for Dendro to dataset references in tbl_datasets contains a major error. This has resulted in that the Dendro data often having MAL or sometimes a Bugs references as well and Dendro references. Considering that the dataset is newer that the MAL dataset it shouldn’t contain older data.

If going through e.g. Site 2043 (Kalmar 93:2 - https://supersead.humlab.umu.se/site/2043), all data contained is correct but when looking at biblio_id in tbl_datasets it has a MAL reference. I'm not quite sure how to explain it but it appears that starting from site 2034 that maybe there was an issue that occured when reading primary keys when the data was ingested that resulted in some sites using low biblio_id in the 1-100s instead of the actuall ID's that should be somewhere between 6100-300.

Here's an querry that looks at Dendro sites and what dataset they are using. You will see a clear cut starting with site 2034 where MAL and Bugs references start being used.

SELECT *
FROM tbl_datasets d
    JOIN tbl_analysis_entities ae ON ae.dataset_id = d.dataset_id
    JOIN tbl_physical_samples ps ON ps.physical_sample_id = ae.physical_sample_id
    JOIN tbl_sample_groups sg ON sg.sample_group_id= ps.sample_group_id
WHERE sg.site_id BETWEEN 1980 AND 3400
ORDER BY sg.site_id, d.biblio_id

image

MattiasSealander commented 4 years ago

This is likely a miss in the numerous re-edits of the original excel-file and only for the archaeological data (hence the clear cut off). Need to double check whether my local file is the same as the one used in XML generation, but the local file has the previous local system_id for the bibliography where new references started from 1 and iterated upwards. Hence, why MAL references are used.

Biblio was later changed to be imported with all lookup data before the remaining data giving them a SEAD id, and the change in IDs wasn't transferred by myself to the dataset table in the excel-file.

This can be solved by looking up their new IDs and then making a SQL update. I will set aside time to do this.

MattiasSealander commented 4 years ago

UPDATE_biblio_id_dendro_datasets.txt

Attached is a text file with UPDATE SQL. Since the issue was simply a failure in updating the biblio_id to their new SEAD ID it can be fixed by using an UPDATE with a WHERE clause which targets based on master_set_id and current biblio_id.

master_set_id ensures that only dendro data is affected, and each biblio_id is then corrected with one UPDATE each.

The issue was found only in the archaeological data, so there shouldn't be a risk of possible confusion between the building/archaeo datasets since only archaeological data should have a biblio_id between 1-55.

MattiasSealander commented 9 months ago

With the new import of Dendro data this issue is resolved and quality check of new data will simply need to double check any errors of this nature