Open jpatokal opened 9 months ago
SPARQL, how do I hate thee? Let me count the ways.
# Name, Alias, IATA, ICAO, Callsign, Country
SELECT ?airline ?airlineLabel ?iata ?icao ?callsign ?countryLabel ?countryIso
(xsd:date(?start) AS ?startDate) (xsd:date(?end) AS ?endDate) # format as ISO dates
WHERE
{
?airline wdt:P31 wd:Q46970 . # instance of airline
?airline wdt:P229 ?iata . # IATA code not optional
OPTIONAL{?airline wdt:P230 ?icao .}
OPTIONAL{?airline wdt:P571 ?start .}
OPTIONAL{?airline wdt:P576 ?end .}
OPTIONAL { ?airline wdt:P17 ?country .
?country wdt:P297 ?countryIso }
OPTIONAL{?airline wdt:P432 ?callsign .}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Here are dry runs against prod, both the full output and a grep of rename operations only -- please find bugs! If there's no major/systemic problems I'll pull the trigger in prod tomorrow, which will update 1296 airlines.
Woo; wikidata. I've pinged some people at Wikimedia Deutschland to let them know about the move to using wikidata :)
Here are dry runs against prod, both the full output and a grep of rename operations only -- please find bugs! If there's no major/systemic problems I'll pull the trigger in prod tomorrow, which will update 1296 airlines.
from the rename log... Not reviewed every line, but this stuck out to me:
>> MATCH WITH RENAME: from Myanma Airways to Myanmar National Airlines
>> MATCH WITH RENAME: from Myanma Airways to Union of Burma Airways
Not sure we can rename it twice? Or is there some hidden/simplified characters in there not being displayed? Or more likely, two different IATA airline codes for the same "display" name, at least, in the OpenFlights DB?
If it is that, we should probably output the IATA and/or the ICAO codes for the airlines in the logs
Then we rename something to the old name; which is fine itself, I think.
>> MATCH WITH RENAME: from Burma Airways to Myanma Airways
The updated log looks more reasonable though... I guess the grep/rename shows the condensed version.
BUT.
> MATCH Myanma Airways (UB/UBA, Wikidata) == Myanma Airways (UB/UBA, 3569)
UPDATE airlines SET source='Wikidata', start_date='1989-04-01', end_date='2014-12-01', active='N' WHERE alid=3569
...
>> MATCH WITH RENAME: from Myanma Airways to Union of Burma Airways
> MATCH Union of Burma Airways (UB/UBA, Wikidata) == Myanma Airways (UB/UBA, 3569)
>> UPDATED: field name from Myanma Airways to Union of Burma Airways
UPDATE airlines SET name='Union of Burma Airways', alias='Myanma Airways', start_date='1948-12-15', end_date='1972-12-01' WHERE alid=3569
>> MATCH WITH RENAME: from Union of Burma Airways to Burma Airways
> MATCH Burma Airways (UB/UBA, Wikidata) == Union of Burma Airways (UB/UBA, 3569)
>> UPDATED: field name from Union of Burma Airways to Burma Airways
UPDATE airlines SET name='Burma Airways', alias='Union of Burma Airways', start_date='1972-12-01', end_date='1989-04-01' WHERE alid=3569
>> MATCH WITH RENAME: from Burma Airways to Myanma Airways
> MATCH Myanma Airways (UB/UBA, Wikidata) == Burma Airways (UB/UBA, 3569)
>> UPDATED: field name from Burma Airways to Myanma Airways
UPDATE airlines SET name='Myanma Airways', alias='Burma Airways', start_date='1989-04-01', end_date='2014-12-01' WHERE alid=3569
>> MATCH WITH RENAME: from Myanma Airways to Myanmar National Airlines
> MATCH Myanmar National Airlines (UB/UBA, Wikidata) == Myanma Airways (UB/UBA, 3569)
>> UPDATED: field name from Myanma Airways to Myanmar National Airlines
UPDATE airlines SET name='Myanmar National Airlines', alias='Myanma Airways', start_date='2014-12-01' WHERE alid=3569
And trim out the comments...
UPDATE airlines SET source='Wikidata', start_date='1989-04-01', end_date='2014-12-01', active='N' WHERE alid=3569
UPDATE airlines SET name='Union of Burma Airways', alias='Myanma Airways', start_date='1948-12-15', end_date='1972-12-01' WHERE alid=3569
UPDATE airlines SET name='Burma Airways', alias='Union of Burma Airways', start_date='1972-12-01', end_date='1989-04-01' WHERE alid=3569
UPDATE airlines SET name='Myanma Airways', alias='Burma Airways', start_date='1989-04-01', end_date='2014-12-01' WHERE alid=3569
UPDATE airlines SET name='Myanmar National Airlines', alias='Myanma Airways', start_date='2014-12-01' WHERE alid=3569
As they all have the same alid
...
We can see it makes a few different updates (at least 3 of which are no-ops by the end; not the end of the world, but not the most useful/performant), and will result in a row that (incorrectly) has an end date, for the current name.
It will also incorrectly be marked as active='N'
from the initial SQL query, which isn't what we want, I imagine
It kinda looks like we need multiple aliases for an airline (which is probably best served via a separate table, for ease of searchability, rather than storing multiple in one column and the fun that might bring) and can store start/end dates there too...
I don't think we need aliases: the ideal end state would be a chain of airlines with start/end dates that smoothly cover the entire date range. Unfortunately Wikidata is quite inconsistent here, there's at least three different relations for inactive/followed by/replaced by and particularly for obscure stuff like Myanmar the data quality is poor anyway.
And the other problem is that the updater script mostly matches and renames the existing entry, which isn't consistent with this. So I think it's back to the drawing board here.
Unfortunately Wikidata is quite inconsistent here, there's at least three different relations for inactive/followed by/replaced by and particularly for obscure stuff like Myanmar the data quality is poor anyway.
The joys of crowd sourcing...
Some of this could be unintentional; different people adding different things for different reasons. But they could be different succession types too, to be really unhelpful.
Depending on how wide spread the issue is, I could bring it up on the wiki, and see if we can get any normalisation etc. Or just fix them myself.
I don't think we need aliases: the ideal end state would be a chain of airlines with start/end dates that smoothly cover the entire date range.
Aha, so if that was the intention, we don't want to be always updating using the same airline id, but would potentially also mean updating airline ids in existing flights depending on the date...
Yup, updating the airlines to match dates would have to be retrofitted later. This seems like it should be quite straightforward, although there's still some nuance to eg. picking the passenger airline instead of the cargo operation...
Airline data on the live site is badly out of date because IATA codes are frequently reused and it is not possible to match airlines to flight dates.
This issue tracks extracting start (creation) and end (dissolution) dates from Wikidata and uploading this to the live DB.