netwerk-digitaal-erfgoed / project-organisations-datasets

Repository for project "Organisations and datasets in the network"
1 stars 0 forks source link

Doubles in foaf:Organization #18

Open liekeverhelst opened 4 years ago

liekeverhelst commented 4 years ago

There are quite a few doubles in foaf:Organization. The URI is different but the labels are the same, or almost the same. The labels are exactly the same: the code should check if an organization with the same name already exists. The labels are almost the same (example: "Heemkring Molenheide" and "Heemkring 'Molenheide'") some kind of other matching should be done and the harvest should be corrected in order to link the information with the proper organization. There are also examples where the labels are very different ("BG (Netherlands Instituut van Beeld en Geluid)" and "Beeld & Geluid"). Here manual correction is probably necessary.

liekeverhelst commented 4 years ago

..or maybe do a full harvest and do a manual cleanup of doubles when harvest is complete (could potentially be a lot of work... friendly user interface would help)

roland-c commented 4 years ago

This is a fundamental issue collecting information about the same (unknown) individuals in different datasets. Therefore the need for shared identifiers, that are often not there so we need to mint those based on labels. The ambiguity this creates is inevitable. Partially it can be resolved with additional (fuzzy) matching of labels, which is another process step in compiling the register. This additional step in the process should be based on the data we have gathered and that is accessible with SPARQL, imho. Partially this is manual work, but the result of both automatic and manual correction is additional data to the original data and should be conserved as such.

What I can do now is remove the register prefix from the URI when making URI' s for Organizations, which matches the exact same labels. Same as done for mediaType.