ror-community / ror-roadmap

Central information about what is happening at ROR and how to contribute feedback
11 stars 2 forks source link

[FEATURE] Incorporate automatic Wikidata updating into release process #170

Open amandafrench opened 1 year ago

amandafrench commented 1 year ago

Describe the problem you would like to solve Currently @arthurpsmith does periodic syncing of the ROR registry to Wikidata -- see https://www.wikidata.org/wiki/User:APSbot. Arthur also does some manual work to prevent duplicate ID creation in Wikidata.

Describe the solution you'd like Ideally, Wikidata would be updated to match updates in ROR automatically upon every new release of the ROR registry as part of the release process and duplicate Wikidata IDs would not be created.

Who would benefit from this feature? Users of Wikidata, users of Wikidata and ROR

arthurpsmith commented 1 year ago

For reference (I shared this with Adam last year) - the things I run for this are in GitHub under https://github.com/arthurpsmith/wikidata-tools - see https://github.com/arthurpsmith/wikidata-tools/blob/master/APSbot/ROR/ROR_UPDATE_README in particular.

To compare with what's already in wikidata one important step is fetching the ROR id's already there - the following SPARQL query is what I use: SELECT ?item ?ror ?deprecated WHERE { ?item p:P6782 ?stmt . ?stmt ps:P6782 ?ror; wikibase:rank ?rank . BIND(?rank = wikibase:DeprecatedRank AS ?deprecated)

deprecated wikidata entries and non-active ROR records should probably be ignored for syncing purposes, though I don't think I've been entirely consistent on that (the active/inactive status thing is relatively new).

There are actually many (around 1000?) cases where the wikidata ID in Wikidata and the one in ROR differ. In general the Wikidata record is probably correct - this often happens if the ROR id was assigned before two Wikidata items were merged together. There are also cases where ROR has assigned a Wikidata id based on a name match but it is actually the wrong item - for example a disambiguation item for several things with the same name. It would be nice to get those fixed in ROR, I think I sent a list to Adam a while ago... I do try to double-check those discrepancies for new mismatches with each release.

For (active) ROR id's that are NOT yet listed in Wikidata there are then three cases:

The last thing I do is a double-check to see if any of the new items I added actually do seem to be duplicates of existing items; hopefully this doesn't happen and can be avoided by better checking for duplicates ahead of time - the most common case here is that I missed a name match for a non-English name. In the case of accidental duplicates the items can be relatively easily merged.

adambuttrick commented 4 months ago

Related/overlapping issue: https://github.com/ror-community/ror-roadmap/issues/193.

arthurpsmith commented 4 months ago

FYI I just updated my github repo for this - it was slightly out of date with respect to what I'm actually running right now - in particular the old version hadn't accounted for 'status' - the new one ignores anything that's not 'active' which may not be what you want long term but was really all I needed.

Would you like me to generate a list of ROR-Wikidata ID mismatches (where the wikidata id in ROR is not the one that has that ROR in wikidata)? There may be some additional metadata (class of the ROR wikidata id, or if it's a redirect) that would be helpful to judge whether these should be changed...

arthurpsmith commented 1 month ago

github repo updated again to use ROR schema 2 data.