tdwg / bdq

Biodiversity Data Quality (BDQ) Interest Group
https://github.com/tdwg/bdq
43 stars 7 forks source link

TG2 - What about testing for duplicate records in a dataset? #176

Open ianengelbrecht opened 5 years ago

ianengelbrecht commented 5 years ago

While the tests and assertions are intended to be applied to individual records, is there room for tests like this on a dataset as a whole? Would they be useful or is it implicit that datasets shouldn't have duplicate records?

ArthurChapman commented 5 years ago

We discussed this in detail previously. Identifying

  1. What people mean as duplicates can be different a). Collector +collector number - some collectors start their numbers anew each year so unreliable; b). Collector, + date - may have collected from >1 plant of species on same day c). Collector + Collector no+ date - but could have collected from the same plant and split into two specimens - one with flowers, one with fruits. d). Collector + Collector no. +date + location - but again see c). above e). Different institutions may use different formats for a Collector (F.Mueller, F von Mueller, F.M., FvM, F.Muell. etc.) in their databases, adding another layer of complexity.
  2. A duplicate is not an "error". Botanists "duplicate" parts of a plant to many institutions, often before georeferencing, etc. which can mean each of the duplicates are different (and difficult to detect). But at the same time provide valuable input. They can also show different things to different users. To taxonomists they supply valuable different information, to a modeller, they may be just noise and cause bias.
  3. In modelling , a duplicate may be regarded as more than one record in a grid square (making use of the term confusing as it is not a duplicate other than by location alone.
  4. We have retained the tests we originally had for duplicates in the Supplementary (non Core) tests along with hundreds of other tests we regarded at the time as being "non-Core"
  5. Once specimen level GUIDs become acceptable practice, then true duplicates should be able to be identified, linked and documented.

To have a reliable test of this nature is difficult (CRIA in Brazil did it and it works reasonably well) but again - from my view - not a core test.

ianengelbrecht commented 5 years ago

Thanks for the feedback @ArthurChapman. My question originates from working with data extracted by various institutions where they haven't understood the results of doing joins in queries over 1:n relationships in their databases. So there can be several copies of a record with different values for a particular field (like collector). Agreed though that yes this should not be a core test. Are the supplementary tests documented online?

ArthurChapman commented 5 years ago

Not really documented and I am sure that they won't be in this process. Many of the Supplementary tests are "use/user dependant" where as most of the Core Tests are "use/user independant" - i.e. they are picking up problems (often true errors) irregardless what the use is likely to be. There is a spreadsheet. I don't have the link with me as I am travelling. But @Tasilee could provide it for you. They are partly documented in that spreadsheet, but haven't been put into the formats that the core tests have evolved to during discussions.

Tasilee commented 5 years ago

@ianengelbrecht - I've sent you a link to the original master spreadsheet - note worksheets, one covers the Supplementary tests.

timrobertson100 commented 4 years ago

While the tests and assertions are intended to be applied to individual records, is there room for tests like this on a dataset as a whole? Would they be useful or is it implicit that datasets shouldn't have duplicate records?

They definitely would be useful @ianengelbrecht and increasingly necessary. At GBIF-scale, for example, DwC Occurrence records for a DNA sequence, a citation in literature and specimens from Institutions can all relate to the same collecting event in nature but the idiosyncrasies of data management make it difficult to link them. We are currently exploring clustering algorithms to provide these links.

Not really documented and I am sure that they won't be in this process

Do you envisage a future process accommodating this kind of thing, please? There is a family of quality tests that are needed (e.g. outlier detection, the likelihood of identification, similar records, inferring grids etc).

Tasilee commented 4 years ago

I agree with @timrobertson100 - the tests were designed to apply at the record level and therefore assertions from the tests can be accumulated on any set of multiple records.

I also agree with @ArthurChapman regards detecting 'duplicate' records. The ALA also has a duplicate record tester but the permutations of pathways and alterations/amendments requires a more comprehensive review and a more standardized use of identifiers. The latter is happening (e.g., GBIF-ALA alignment project), but slowly.

chicoreus commented 1 year ago

@ianengelbrecht the framework within which we are defining these tests allows for the definition of tests that operate on data sets (MultiRecord in terms of the framework), and allows for the definition of tests that would specify how duplicates might be detected, and how those duplicates might be dealt with. Running such a test for QualityAssurance might result in assertions that duplicate occurrence records should be deduplicated. Similarly a test could be phrased to add record relationships linking potential duplicate occurrence records. The potential data quality needs are diverse, and the potential range of what such tests might specify are large, so we haven't tried to define one here.

Taking @ArthurChapman 's case "Collector + Collector no+ date" we might frame a (non-CORE) test in the form:

Field Value
Label AMENDMENT_OCCURRENCE_DUPLICATED
Description Does the value of dwc:geodeticDatum occur in bdq:sourceAuthority?
Output Type Amendment
Resource Type MultiRecord
Information Elements dwc:recordedBy, dwc:fieldNumber, dwc:eventDate, dwc:occurrenceID, resourceID, relationshipOfResourceID, relatedResourceID
Expected Response INTERNAL_PREREQUISITES_NOT_MET if any of dwc:recordedBy, dwc:fieldNumber, dwc:eventDate, or dwc:occurrenceID are EMPTY; FILLED_IN for each SingleRecord where dwc:recordedBy, dwc:fieldNumber, dwc:eventDate, are identical to another SingleRecord with a different dwc:occurrenceID assert resourceID, relationshipOfResourceID="duplicate of", relatedResourceID relating these records.
Data Quality Dimension Uniqueness

Not in any way recommending this test, just indicating how such a test might be phrased in a form consistent with the framework and the CORE tests we have defined, though it is quite different in that it pertains to a MultiRecord rather than a SingleRecord, and we haven't that space very much.

Tasilee commented 1 year ago

I agree with @chicoreus. We have concluded that our tests relate to Darwin Core terms in a single record. Also, de-duplication is non-trivial. In many cases I have seen in the ALA (where duplicate records is a flag), the 'best' outcome, after considerable research, is an 'amalgamated' record.

ArthurChapman commented 1 year ago

As mentioned by @chicoreus and @Tasilee - our tests test data within a single record within one database. Testing for duplicates is a much larger problem.

  1. Duplicates records in the database of the same record
  2. Duplicates in one database - but some institutions may have different records for specimen, liquid material, DNA, etc. with links
  3. Duplicates between databases which are actually different specimens and may show different things - Syntypes, Isotypes, etc.
  4. Duplicates of a species at a location (a term used in modelling where multiple species records in one grid may be deleted)
    • also see my comment above.

These types of tests become very user/use dependent.

chicoreus commented 1 year ago

On Fri, 23 Jun 2023 17:22:08 -0700 Arthur Chapman @.***> wrote:

These types of tests become very user/use dependent.

Which contributes to our judgement that they don't belong in CORE.