Open david-linssen opened 4 years ago
also submitted by @ChristiaanScheermeijer
Scope for M24 is: import existing relations from IMSLP/Wikipedoa/MusicBrainz
Preventing duplicates for external sources can be reached by adding a unique node property constraint on the source property for example. The identifier field can also be used, but is now filled with a uuid. The identifier (based on Thing) could also be the source uri. Are there any implications when doing this? If this is not a desirable solution, clients should check for existing nodes before inserting. Discussion: Which field(s) to check to decide if a duplicate entry exists?
For duplicates from the same database, we can use the source
field. This will be the exact link to the page where the data was collected from, e.g. https://musicbrainz.org/artist/8d610e51-64b4-4654-b8df-064b0fb7a9d9 or https://www.wikidata.org/entity/Q7304
For M24 we will import data from:
If any of these sources has existing metadata links to any other source, we will use skos:ExactMatch to say that these items are the same.
The next part of this task (which for now will probably be out of the scope of M24) is to match items when there are no existing relationships (e.g. an artist on MusicBrainz and muziekweb which is the same, but has no common links to each other or through viaf/worldcat, etc). This matching will require some kind of heuristic (edit distance), or could be a crowd-sourcing task. Once we identify these links, we should contribute them back to the primary data sources.
@alastair, in a recent version of neo4j-graphql-js it is possible to add a @unique
directive to properties which can only exist once in the Neo4j instance. This should also be added for all identifier properties as these can currently exist multiple times.
I also suggest that we add some custom mutations making it easier to "tag" nodes related to each other. Now we would need to perform multiple queries/mutations to create a bi-directional relationship between two nodes.
p1:Person-[:EXACT_MATCH]->p2:Person
p2:Person-[:EXACT_MATCH]->p1:Person
type _matchInput {
identifier: ID!
}
type _matchResult {
fromIdentifier: ID!
toIdentifier: ID!
}
type Mutation {
AddBroadMatch(from: _matchInput!, to: _matchInput!) : _matchResult
AddCloseMatch(from: _matchInput!, to: _matchInput!) : _matchResult
AddExactMatch(from: _matchInput!, to: _matchInput!) : _matchResult
AddNarrowMatch(from: _matchInput!, to: _matchInput!) : _matchResult
AddRelatedMatch(from: _matchInput!, to: _matchInput!) : _matchResult
RemoveBroadMatch(from: _matchInput!, to: _matchInput!) : _matchResult
RemoveCloseMatch(from: _matchInput!, to: _matchInput!) : _matchResult
RemoveExactMatch(from: _matchInput!, to: _matchInput!) : _matchResult
RemoveNarrowMatch(from: _matchInput!, to: _matchInput!) : _matchResult
RemoveRelatedMatch(from: _matchInput!, to: _matchInput!) : _matchResult
}
@CasperCDR @alastair we are now running a recent version of the neo4j-graphql-js which supports the @unique
directive. Is it still relevant to add to the CE-API?
We have @unique on identifier
, but we don't have it on source
- it's still possible to import the same item from musicbrainz twice.
Having said that, I don't think it's a good idea to add unique to source, because we could have multiple objects that describe different aspects of a single source.
I don't know a good way (other than being careful with our code) to ensure that we don't import the same data multiple times.
submitted by UPF, relevant for Scholars & enthusiasts use-cases. awarded 3 dots, assigned to @alastair