Data validation tools to prevent duplicates/misspellings

david-linssen commented 4 years ago

submitted by UPF, relevant for Scholars & enthusiasts use-cases. awarded 3 dots, assigned to @alastair

david-linssen commented 4 years ago

also submitted by @ChristiaanScheermeijer

david-linssen commented 4 years ago

Scope for M24 is: import existing relations from IMSLP/Wikipedoa/MusicBrainz

CasperCDR commented 4 years ago

Preventing duplicates for external sources can be reached by adding a unique node property constraint on the source property for example. The identifier field can also be used, but is now filled with a uuid. The identifier (based on Thing) could also be the source uri. Are there any implications when doing this? If this is not a desirable solution, clients should check for existing nodes before inserting. Discussion: Which field(s) to check to decide if a duplicate entry exists?

alastair commented 4 years ago

For duplicates from the same database, we can use the source field. This will be the exact link to the page where the data was collected from, e.g. https://musicbrainz.org/artist/8d610e51-64b4-4654-b8df-064b0fb7a9d9 or https://www.wikidata.org/entity/Q7304

For M24 we will import data from:

MusicBrainz
WikiData
IMSLP
CPDL
muziekweb
viaf (if it exists in one of the above)
library of congress (if it exists in one of the above)
worldcat (if it exists in one of the above)
isni (if it exists in one of the above)

If any of these sources has existing metadata links to any other source, we will use skos:ExactMatch to say that these items are the same.

The next part of this task (which for now will probably be out of the scope of M24) is to match items when there are no existing relationships (e.g. an artist on MusicBrainz and muziekweb which is the same, but has no common links to each other or through viaf/worldcat, etc). This matching will require some kind of heuristic (edit distance), or could be a crowd-sourcing task. Once we identify these links, we should contribute them back to the primary data sources.

ChristiaanScheermeijer commented 3 years ago

@alastair, in a recent version of neo4j-graphql-js it is possible to add a @unique directive to properties which can only exist once in the Neo4j instance. This should also be added for all identifier properties as these can currently exist multiple times.

https://grandstack.io/docs/graphql-schema-directives

ChristiaanScheermeijer commented 3 years ago

I also suggest that we add some custom mutations making it easier to "tag" nodes related to each other. Now we would need to perform multiple queries/mutations to create a bi-directional relationship between two nodes.

p1:Person-[:EXACT_MATCH]->p2:Person
p2:Person-[:EXACT_MATCH]->p1:Person

type _matchInput {
  identifier: ID!
}

type _matchResult {
  fromIdentifier: ID!
  toIdentifier: ID!
}

type Mutation {
  AddBroadMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  AddCloseMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  AddExactMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  AddNarrowMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  AddRelatedMatch(from: _matchInput!, to: _matchInput!) : _matchResult

  RemoveBroadMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  RemoveCloseMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  RemoveExactMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  RemoveNarrowMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  RemoveRelatedMatch(from: _matchInput!, to: _matchInput!) : _matchResult
}

ChristiaanScheermeijer commented 3 years ago

@CasperCDR @alastair we are now running a recent version of the neo4j-graphql-js which supports the @unique directive. Is it still relevant to add to the CE-API?

alastair commented 3 years ago

We have @unique on identifier, but we don't have it on source - it's still possible to import the same item from musicbrainz twice. Having said that, I don't think it's a good idea to add unique to source, because we could have multiple objects that describe different aspects of a single source.

I don't know a good way (other than being careful with our code) to ensure that we don't import the same data multiple times.

trompamusic / ce-api

Data validation tools to prevent duplicates/misspellings #74