ngds / data-quality-tools

These are system level scripts that check resource links and data base integrity
0 stars 0 forks source link

Data Duplication - Records with different guid but same title #4

Open GaryHudmanAZGS opened 6 years ago

GaryHudmanAZGS commented 6 years ago

The GDR harvest process is not correctly processing history, new records are added, but older records are not correctly marked as archived. Approximately 3k records are in error.

smrgeoinfo commented 6 years ago

See spreadsheet summary .

The spreadsheet has 21202 rows, and 9003 unique titles. Investigation reveals that there are various conditions here-- multiple identical records for same resource, different resources with same title; multiple different records for same resources. Details in this and following comments.

There are duplicate titles (e.g. 4 records have title 'Shallow Geothermal Energy') from the USGIN Geothermal catalog harvest sources that will need to be cleaned up.

Problematic records harvested from SMU on 3/08/2018; they do not have abstracts. Keep the harvested records from USGIN on 04/04/2018 with the same title; these have abstracts.

smrgeoinfo commented 6 years ago

428 records are duplicate titles from 'Energy & Geoscience Institute GINstack node' harvest of 12/28/2015. There are records with 2, 3, and 4 duplicate titles.

Investigate 'Apparent Resistivity Computed, Camas Washington'; there are 4 metadata records, each is a different modeled profile , line A and B with different methods/assumptions.

looks like these duplicate titles are likely different docs..., will require manual cleanup.

smrgeoinfo commented 6 years ago

2 records from alaska look like duplicates, harvested from USGIN Catalog and Alaska Node. Keep alaska node records cdf82758-5450-498c-b313-101b9b06a843 | Chena Hot Springs, Alaska 4052911d-1b52-40ec-89b9-6f1c880c5072 | Preliminary Theory Of The Wairakei Geothermal Field

smrgeoinfo commented 6 years ago

9 unique titles from NREL Geothermal Data Repository repeated 3 or 4 times, total 29 records: Active Management of Integrated Geothermal-CO2 Storage Reservoirs in Sedimentary Formations Data Files Active Management of Integrated Geothermal-CO2 Storage Reservoirs in Sedimentary Formations Rerervoir-models-inputs-outputs-index.html Active Management of Integrated Geothermal-CO2 Storage Reservoirs in Sedimentary Formations Section 2.1.6.B Active Management of Integrated Geothermal-CO2 Storage Reservoirs in Sedimentary Formations Sections 2.1.2 and 2.1.4 Active Management of Integrated Geothermal-CO2 Storage Reservoirs in Sedimentary Formations Sections 2.1.3 and 2.1.5.A-C Active Management of Integrated Geothermal-CO2 Storage Reservoirs in Sedimentary Formations Sections 2.1.5.D and 2.1.6.A Low-Temperature Hydrothermal Resource Potential Estimate Mullane_66461_NREL.docx Mountain Home Well - Borehole Geophysics Database Mountain_Home_Ftot.csv

seems like all the 'Active Management of Integrated Geothermal-CO2 Storage Reservoirs in Sedimentary Formations' stuff would make better sense as one repository item with a multiple files.

Search in CKAN for 'Active Management of Integrated Geothermal-CO2 Storage Reservoirs in Sedimentary Formations Sections 2.1.5.D and 2.1.6.A' gets 12 hits, and the records all seem to be about the same repository item, (submission 165). There are variations in the punctuation of the abstract, but the content seems pretty much identical. Keywords vary between records. Looks like multiple versions of same metadata from NREL??? Some of the records have same GUID and title.

smrgeoinfo commented 6 years ago

from USGS, have 86 titles harvested from USGS and from SMU.

Investigation of 'Audio-Magnetotelluric Data Log And Station Location Map For Gerlach Northwest Known Geothermal Resource Area, Nevada' -- search in ckan gets three hits with same title (one with different capitalization (it has the EGI distribution). They each have different distribution links for the doc, from USGS science base, OSTI and from EGI.

Investigate 'Near-surface heat flow in Saline Valley, California' finds same pattern. USGS pubs that have been loaded into EGI and OSTI repositories.