Closed roger-mahler closed 1 year ago
More explanation needed please.
See updates (commented out) in 20170101_DDL_BIBLIO_REFACTOR_MODEL.
The linked script refactors the old bibliographic reference model. Three new fields are added to tbl_biblio, and a number of other fields and tables are deprecated.
We need to update the new fields using the deprecated fields and tables are removed. These updates are commented out in the script, since they are not validated.
Related to (or same as) #18.
Love and I can look at this, I'd like to prioritize the bibliography for online in the next few months as this is critical for traceability.
After scanning through tbl_biblio I discovered that *2126 posts were missing data or containing wrong year in the column “year_published”, however, the correct year are given in other columns (could this be an import issue?). This can be solved using a quick update query. This particular issue doesn’t concern the end-user who will access either authors or full reference which both contain the correct year, this is more a back-end documentation issue for the Bugs dataset references.
There is a clear difference between the Bugs and SEAD ingested references in terms of their format which cause some issue in terms of title and full reference. For example, the first 400 posts lack information about authors to the published texts, I believe some of these are MAL reports. This could therefore easily be corrected if asking the MAL staff for a list of publications that they have occasionally published online in an excel format.
A few posts have misplaced DOI and website information into the wrong columns.
OK, there is a significant risk that the biblio records have not been attached to the correct datasets and sites during the import. We need to check this first, and possibly re-import all the bibliography.
There is a list of all MAL reports on the server at \MAL-data1\MAL Rapporter
There are currently 469 biblio_id's without any connection to datasets, six of these are test data that remains within the database from an earlier stage I assume.
384 of these are Bugs references and 79 are SEAD references.
EDIT It appears that some of the SEAD posts that are lacking references belong to the ceramic dataset - check in with Mattias about these
Bilbio_id with no connection.xlsx
I ran this querry to check connections between posts and other tables.
--select string_agg(Chr(10), 'select ''' || table_name || ''' as table_name, biblio_id from ' || table_name || ' UNION ')
--from clearing_house.view_foreign_keys
--where schema_name = 'public'
-- and column_name = 'biblio_id'
with all_biblio_references as (
select 'tbl_dataset_masters' as table_name, biblio_id from tbl_dataset_masters UNION
select 'tbl_datasets' as table_name, biblio_id from tbl_datasets UNION
select 'tbl_ecocode_systems' as table_name, biblio_id from tbl_ecocode_systems UNION
select 'tbl_geochron_refs' as table_name, biblio_id from tbl_geochron_refs UNION
select 'tbl_methods' as table_name, biblio_id from tbl_methods UNION
select 'tbl_rdb_systems' as table_name, biblio_id from tbl_rdb_systems UNION
select 'tbl_relative_age_refs' as table_name, biblio_id from tbl_relative_age_refs UNION
select 'tbl_sample_group_references' as table_name, biblio_id from tbl_sample_group_references UNION
select 'tbl_site_other_records' as table_name, biblio_id from tbl_site_other_records UNION
select 'tbl_site_references' as table_name, biblio_id from tbl_site_references UNION
select 'tbl_species_associations' as table_name, biblio_id from tbl_species_associations UNION
select 'tbl_taxa_synonyms' as table_name, biblio_id from tbl_taxa_synonyms UNION
select 'tbl_taxonomic_order_biblio' as table_name, biblio_id from tbl_taxonomic_order_biblio UNION
select 'tbl_taxonomy_notes' as table_name, biblio_id from tbl_taxonomy_notes UNION
select 'tbl_tephra_refs' as table_name, biblio_id from tbl_tephra_refs UNION
select 'tbl_text_biology' as table_name, biblio_id from tbl_text_biology UNION
select 'tbl_text_distribution' as table_name, biblio_id from tbl_text_distribution UNION
select 'tbl_text_identification_keys' as table_name, biblio_id from tbl_text_identification_keys
) select *
from tbl_biblio b
left join all_biblio_references a using (biblio_id)
where a.biblio_id is null
ORDER BY biblio_id
I'm adding a document to clarify what I mean with my previous comment that when rereading was a bit confusing. But I will check a sample set of biblio records and their connections to some other datasets and sites and get back to you @visead
Biblio test data from Bugsdata 20200302 TCountsheet.CountsheetName = Dataset name in SEAD
SELECT TBiblio.*, TSite.SiteCODE, TSite.SiteName, TSite.Country, TCountsheet.CountsheetCODE, TCountsheet.CountsheetName FROM (TSite INNER JOIN TCountsheet ON TSite.SiteCODE = TCountsheet.SiteCODE) INNER JOIN (TBiblio INNER JOIN TSiteRef ON TBiblio.REFERENCE = TSiteRef.Ref) ON TSite.SiteCODE = TSiteRef.SiteCODE; QSiteDataRefs.xlsx
SELECT TBiblio.*, TBiology.CODE, TBiology.Data FROM TBiblio INNER JOIN TBiology ON TBiblio.REFERENCE = TBiology.Ref; QRefsBiology.xlsx
SELECT TBiblio.*, TDistrib.CODE, TDistrib.Data FROM TBiblio INNER JOIN TDistrib ON TBiblio.REFERENCE = TDistrib.Ref; QRefsDistrib.xlsx
I have checked all of the Bugs bibliography and they seem to check out, however, I have not been able to verify some of the newer additions but I know that you and Roger have been handling those. Unfortunately I did not have time to go through the MAL section of the bibliography or any other additions from the newly ingested datasets - these need to be a subject for QA on a later date.
I sent to Roger an update which concerns mainly minor things in tbl_biblio where the wrong year is given or where DOI or URL is contained in the title column instead of their respective columns.
This update regards ca. 1800 posts.
Loves updates handled in 20221205_DML_QUALITY_CONTROL_BIBLIO_UPDATES.
Update is currently missing in refactor DDL. Need SQL update of new fields authors, title, full_ reference based on old fields.