osmlab / name-suggestion-index

Canonical common brand names, operators, transit and flags for OpenStreetMap.
https://nsi.guide
BSD 3-Clause "New" or "Revised" License
712 stars 877 forks source link

Wikidata sync script doesn't appear to search for all discrepancies #9972

Open Snowysauce opened 2 months ago

Snowysauce commented 2 months ago

I was working on attaching Wikidata IDs to transit networks when I came across something odd: the Wikidata item (Q55597931) for the French network liO already had the property P8253 (OSM Name Suggestion Index ID) correctly set with liO's current NSI IDs even though the network:wikidata tag is not present in the tags for liO in data\transit\route\bus.json, nor is Q55597931 present in dist\wikidata.json. It turns out that the property was manually added to the Wikidata item back in March, and the fact that npm build wikidata 1) does not remove this custom property addition that is essentially isolated from the NSI nor 2) adds the Wikidata QID to the relevant item in the data folder upon discovering that it's missing made me curious as to what else is slipping through the cracks.

The problems

As a test, I gathered every QID in dist\wikidata.json and compared that list with the first 500 QIDs that link to Wikidata property P8253. Even in just the first 500 IDs (out of several thousands), there was one item that linked to P8253 without having an entry in dist\wikidata.json: Q125054. I looked at the item in question, and indeed, the property is present on the item with links to long-obsolete NSI IDs dating back to a time when Aldi had a consolidated entry in the NSI. (The property and IDs were added using the normal method in May 2023, but apparently never removed when the Aldi IDs were split and pointed to other Wikidata items.)

Another kind of discrepancy that can exist is one I encountered myself a few weeks back when I adjusted the entries for the bus networks operated by Rochester-Genesee Regional Transportation Authority (RGRTA). The Wikidata QID for RGRTA was moved from network:wikidata to operator:wikidata as part of the changes, and while npm build wikidata did correctly add the bus network NSI IDs and P8253 to the new Wikidata pages for each individual network, it did not remove the NSI IDs and P8253 from RGRTA's Wikidata item. The IDs on the RGRTA item ultimately had to be manually removed once the code changes went live, as OSM's iD editor somehow got confused by having the same NSI ID for the bus networks on two separate Wikidata items: RGRTA and RTS/RTS Genesee/etc. Until I removed the NSI IDs from the RGRTA Wikidata item, the iD editor kept cycling through the old tag presets from the previous release and the new presets from the then-current release.

Similar code changes are awaiting deployment for the Syracuse-based Centro bus operator and its networks, and as of now I expect to have to manually remove the NSI IDs and P8253 from Centro's Wikidata page when the current code is released.

Possible resolution methods

Although the obsolete and duplicate links to NSI IDs on Wikidata could be manually filtered out by comparing the QIDs in dist\wikidata.json to those listed in the "What links here" for Wikidata property P8253, I feel that this is a task better suited for a script. There are over 20,000 QIDs in dist\wikidata.json, and several thousand links on Wikidata to P8253, making the comparison a time-consuming task if done by hand. (EDIT: there are about 20,800 QIDs in the former and 20,250 links to the latter.) A complete list of links to P8253 can be retrieved by script using the MediaWiki API, although I can't remember the precise code since it's been many years since I've used that API. (I was an administrator on the English Wikipedia for a few years under a different username until I retired and exercised my right to vanish.)

I would suggest adding the above check to npm run wikidata after it reads the QIDs present in the data folder unless this would break the code on the iD editor for NSI entries that are edited to point from one Wikidata item to another in between releases of the NSI. If this is the case, then I would suggest making the check part of npm run dist.

Snowysauce commented 1 month ago

I filtered out most of the discrepancies by hand a few days ago. I don't remember the exact numbers, but I think there were around 60 orphaned uses of P8253. IMO it'd still be a good idea to eventually work the checks I described into one of the npm scripts.

UKChris-osm commented 1 month ago

npm build wikidata 1) does not remove this custom property addition that is essentially isolated from the NSI nor 2) adds the Wikidata QID to the relevant item in the data folder upon discovering that it's missing made me curious as to what else is slipping through the cracks.

But in this case, if the wikidata reference isn't in the NSI data file, it wouldn't have a point of reference to find and check the Lio entry, so wouldn't be able to remove the custom property or update the data folder, as it wouldn't be able to find the wikidata item in the first place.

The only way to find orphaned Wikidata entries that have the NSI property attached to it would be to look for the property within the entire Wikidata database.

It would be good if the script could factor in changes on our side though, such as moving a wikidata to a different category, and updating Wikidata acordingly.

Snowysauce commented 1 month ago

The only way to find orphaned Wikidata entries that have the NSI property attached to it would be to look for the property within the entire Wikidata database.

Indeed, and that's where https://www.wikidata.org/wiki/Special:WhatLinksHere/Property:P8253 comes into play. This link is a "clean" version of the results; a raw list that's more tailored for developers and scripts is available through the MediaWiki API (e.g., https://www.wikidata.org/wiki/Special:ApiSandbox#action=query&format=json&list=backlinks&formatversion=2&bltitle=Property%3AP8253, which then needs to be run on a loop to get all links). The results can then be stored and modified by the NSI script in whatever way is needed for the script to compare QID usage.

But in this case, if the wikidata reference isn't in the NSI data file, it wouldn't have a point of reference to find and check the Lio entry, so wouldn't be able to remove the custom property or update the data folder, as it wouldn't be able to find the wikidata item in the first place.

Very good point. After considering this, I think the most a script could do from our end is find orphaned entries on Wikidata (via the above method) and either 1) remove the property from the Wikidata page or 2) warn about their presence in the same way that the script warns about other problems, like deleted Wikidata items.