humlab-sead / sead_change_control

Sane SEAD change control using Sqitch.
1 stars 0 forks source link

SQL update of references after DDL refactor #3

Closed roger-mahler closed 1 year ago

roger-mahler commented 5 years ago

Update is currently missing in refactor DDL. Need SQL update of new fields authors, title, full_ reference based on old fields.

visead commented 5 years ago

More explanation needed please.

roger-mahler commented 5 years ago

See updates (commented out) in 20170101_DDL_BIBLIO_REFACTOR_MODEL.

roger-mahler commented 4 years ago

The linked script refactors the old bibliographic reference model. Three new fields are added to tbl_biblio, and a number of other fields and tables are deprecated.

We need to update the new fields using the deprecated fields and tables are removed. These updates are commented out in the script, since they are not validated.

roger-mahler commented 4 years ago

Related to (or same as) #18.

visead commented 4 years ago

Love and I can look at this, I'd like to prioritize the bibliography for online in the next few months as this is critical for traceability.

Loer0022 commented 4 years ago

After scanning through tbl_biblio I discovered that *2126 posts were missing data or containing wrong year in the column “year_published”, however, the correct year are given in other columns (could this be an import issue?). This can be solved using a quick update query. This particular issue doesn’t concern the end-user who will access either authors or full reference which both contain the correct year, this is more a back-end documentation issue for the Bugs dataset references.

There is a clear difference between the Bugs and SEAD ingested references in terms of their format which cause some issue in terms of title and full reference. For example, the first 400 posts lack information about authors to the published texts, I believe some of these are MAL reports. This could therefore easily be corrected if asking the MAL staff for a list of publications that they have occasionally published online in an excel format.

A few posts have misplaced DOI and website information into the wrong columns.

visead commented 4 years ago

OK, there is a significant risk that the biblio records have not been attached to the correct datasets and sites during the import. We need to check this first, and possibly re-import all the bibliography.

There is a list of all MAL reports on the server at \MAL-data1\MAL Rapporter

Loer0022 commented 4 years ago

There are currently 469 biblio_id's without any connection to datasets, six of these are test data that remains within the database from an earlier stage I assume.

384 of these are Bugs references and 79 are SEAD references.

EDIT It appears that some of the SEAD posts that are lacking references belong to the ceramic dataset - check in with Mattias about these

Bilbio_id with no connection.xlsx


I ran this querry to check connections between posts and other tables.


--select string_agg(Chr(10), 'select ''' || table_name || ''' as table_name, biblio_id from  ' || table_name || ' UNION ')
--from clearing_house.view_foreign_keys
--where schema_name = 'public'
--  and column_name = 'biblio_id'

with all_biblio_references as (
                          select 'tbl_dataset_masters' as table_name, biblio_id from  tbl_dataset_masters UNION
                          select 'tbl_datasets' as table_name, biblio_id from  tbl_datasets UNION
                          select 'tbl_ecocode_systems' as table_name, biblio_id from  tbl_ecocode_systems UNION
                          select 'tbl_geochron_refs' as table_name, biblio_id from  tbl_geochron_refs UNION
                          select 'tbl_methods' as table_name, biblio_id from  tbl_methods UNION
                          select 'tbl_rdb_systems' as table_name, biblio_id from  tbl_rdb_systems UNION
                          select 'tbl_relative_age_refs' as table_name, biblio_id from  tbl_relative_age_refs UNION
                          select 'tbl_sample_group_references' as table_name, biblio_id from  tbl_sample_group_references UNION
                          select 'tbl_site_other_records' as table_name, biblio_id from  tbl_site_other_records UNION
                          select 'tbl_site_references' as table_name, biblio_id from  tbl_site_references UNION
                          select 'tbl_species_associations' as table_name, biblio_id from  tbl_species_associations UNION
                          select 'tbl_taxa_synonyms' as table_name, biblio_id from  tbl_taxa_synonyms UNION
                          select 'tbl_taxonomic_order_biblio' as table_name, biblio_id from  tbl_taxonomic_order_biblio UNION
                          select 'tbl_taxonomy_notes' as table_name, biblio_id from  tbl_taxonomy_notes UNION
                          select 'tbl_tephra_refs' as table_name, biblio_id from  tbl_tephra_refs UNION
                          select 'tbl_text_biology' as table_name, biblio_id from  tbl_text_biology UNION
                          select 'tbl_text_distribution' as table_name, biblio_id from  tbl_text_distribution UNION
                          select 'tbl_text_identification_keys' as table_name, biblio_id from  tbl_text_identification_keys
) select *
  from tbl_biblio b
  left join all_biblio_references a using (biblio_id)
  where a.biblio_id is null
  ORDER BY biblio_id
Loer0022 commented 4 years ago

I'm adding a document to clarify what I mean with my previous comment that when rereading was a bit confusing. But I will check a sample set of biblio records and their connections to some other datasets and sites and get back to you @visead

Tbl_biblio gitbub.xlsx

visead commented 4 years ago

Biblio test data from Bugsdata 20200302 TCountsheet.CountsheetName = Dataset name in SEAD

SELECT TBiblio.*, TSite.SiteCODE, TSite.SiteName, TSite.Country, TCountsheet.CountsheetCODE, TCountsheet.CountsheetName FROM (TSite INNER JOIN TCountsheet ON TSite.SiteCODE = TCountsheet.SiteCODE) INNER JOIN (TBiblio INNER JOIN TSiteRef ON TBiblio.REFERENCE = TSiteRef.Ref) ON TSite.SiteCODE = TSiteRef.SiteCODE; QSiteDataRefs.xlsx

SELECT TBiblio.*, TBiology.CODE, TBiology.Data FROM TBiblio INNER JOIN TBiology ON TBiblio.REFERENCE = TBiology.Ref; QRefsBiology.xlsx

SELECT TBiblio.*, TDistrib.CODE, TDistrib.Data FROM TBiblio INNER JOIN TDistrib ON TBiblio.REFERENCE = TDistrib.Ref; QRefsDistrib.xlsx

Loer0022 commented 4 years ago

I have checked all of the Bugs bibliography and they seem to check out, however, I have not been able to verify some of the newer additions but I know that you and Roger have been handling those. Unfortunately I did not have time to go through the MAL section of the bibliography or any other additions from the newly ingested datasets - these need to be a subject for QA on a later date.

I sent to Roger an update which concerns mainly minor things in tbl_biblio where the wrong year is given or where DOI or URL is contained in the title column instead of their respective columns.

This update regards ca. 1800 posts.

roger-mahler commented 1 year ago

Loves updates handled in 20221205_DML_QUALITY_CONTROL_BIBLIO_UPDATES.