humlab-sead / sead_change_control

Sane SEAD change control using Sqitch.
1 stars 0 forks source link

Tbl_biblio quality assurance and update #81

Closed Loer0022 closed 1 year ago

Loer0022 commented 4 years ago

Currently the effort has been made to organize and provide a consistent and correct output for the MAL dataset. All MAL bibliography currently contained in SEAD has now (unless specific connections between biblio and datasets were uncertain/unclear in the metadata) been connected to their respective dataset. A significant number of bibliographical records has however not been ingested into the dataset and it does not appear to be any pattern to why these haven’t been included in prior ingestion of data (see separate issue #82 for ingestion of the missing data).

The Ceramic dataset has also been made to reviewed in regard to tbl_biblio to ensure that the bibliography is present and its connections to tbl_dataset is correct. Some duplicate records were found in tbl_biblio where one source had four unique iterations (biblio_id 6555, 6651, 6672 and 6544), the record 6555 should be used in the database because it is the only completely correct record. The other records do not have any connections, use SQL to double check, to any tables and can safely be removed from the database.


--select string_agg(Chr(10), 'select ''' || table_name || ''' as table_name, biblio_id from  ' || table_name || ' UNION ')
--from clearing_house.view_foreign_keys
--where schema_name = 'public'
--  and column_name = 'biblio_id'

with all_biblio_references as (
                          select 'tbl_dataset_masters' as table_name, biblio_id from  tbl_dataset_masters UNION
                          select 'tbl_datasets' as table_name, biblio_id from  tbl_datasets UNION
                          select 'tbl_ecocode_systems' as table_name, biblio_id from  tbl_ecocode_systems UNION
                          select 'tbl_geochron_refs' as table_name, biblio_id from  tbl_geochron_refs UNION
                          select 'tbl_methods' as table_name, biblio_id from  tbl_methods UNION
                          select 'tbl_rdb_systems' as table_name, biblio_id from  tbl_rdb_systems UNION
                          select 'tbl_relative_age_refs' as table_name, biblio_id from  tbl_relative_age_refs UNION
                          select 'tbl_sample_group_references' as table_name, biblio_id from  tbl_sample_group_references UNION
                          select 'tbl_site_other_records' as table_name, biblio_id from  tbl_site_other_records UNION
                          select 'tbl_site_references' as table_name, biblio_id from  tbl_site_references UNION
                          select 'tbl_species_associations' as table_name, biblio_id from  tbl_species_associations UNION
                          select 'tbl_taxa_synonyms' as table_name, biblio_id from  tbl_taxa_synonyms UNION
                          select 'tbl_taxonomic_order_biblio' as table_name, biblio_id from  tbl_taxonomic_order_biblio UNION
                          select 'tbl_taxonomy_notes' as table_name, biblio_id from  tbl_taxonomy_notes UNION
                          select 'tbl_tephra_refs' as table_name, biblio_id from  tbl_tephra_refs UNION
                          select 'tbl_text_biology' as table_name, biblio_id from  tbl_text_biology UNION
                          select 'tbl_text_distribution' as table_name, biblio_id from  tbl_text_distribution UNION
                          select 'tbl_text_identification_keys' as table_name, biblio_id from  tbl_text_identification_keys
) select *
  from tbl_biblio b
  left join all_biblio_references a using (biblio_id)
  where a.biblio_id is null
  ORDER BY biblio_id

All ceramics connections made between tbl_biblio and tbl_datasets are correct, however, there are many records that currently lack connections that are clearly defined in the metadata (readily provided by Mattias). I have added bibliography to a couple datasets but did not have time to go through the entire list. As mentioned, Mattias has the needed metadata to make these corrections and all that is needed to be done is use a query to retrieve data from tbl_datasets where biblio_id IS NULL and belongs to master_set_id 3 to crosscheck and correct it with the metadata.

The Dendro dataset has also been assured where some faults were found relating to an issue reading keys when importing the data. This resulted in that MAL bibliography was shown for VDL dataset, this was detected first when using the query below to see which biblio_id was connected where (this resulted in an inverted relationship where no MAL biblio_id should be connected anywhere whereas the Dendro biblio_id should, a false positive). Some random sites were controlled using SuperSEAD to get a quick overview of the dataset, MAL bibliography was detected here and first believed to be a minor mistake when creating connections. But on further inspection in tbl_datasets controlling both methods and dataset_masters_id it was concluded that a critical error had occurred which had mislabelled a large section of SEAD bibliography. This issue was however easily fixed, see issue #79

Loer0022 commented 4 years ago

Update for some issues found in tbl_biblio tbl_biblio update MAL-Ceramic revised.txt

Update to create and correct some connections made in tbl_datasets tbl_datasets update MAL-Ceramic.txt

roger-mahler commented 1 year ago

Update for some issues found in tbl_biblio:

Update for some issues found in tbl_datasets: