jpatokal / openflights

Website for storing flight information, rendering paths on a zoomable world map and calculating statistics, with plenty of free airline, airport and route data.
http://openflights.org
GNU Affero General Public License v3.0
1.37k stars 387 forks source link

Annotate airline data with start/end dates #1439

Open jpatokal opened 9 months ago

jpatokal commented 9 months ago

Airline data on the live site is badly out of date because IATA codes are frequently reused and it is not possible to match airlines to flight dates.

This issue tracks extracting start (creation) and end (dissolution) dates from Wikidata and uploading this to the live DB.

jpatokal commented 9 months ago

SPARQL, how do I hate thee? Let me count the ways.

# Name, Alias, IATA, ICAO, Callsign, Country
SELECT ?airline ?airlineLabel ?iata ?icao ?callsign ?countryLabel ?countryIso
  (xsd:date(?start) AS ?startDate) (xsd:date(?end) AS ?endDate) # format as ISO dates
WHERE 
{
  ?airline wdt:P31 wd:Q46970 .  # instance of airline
  ?airline wdt:P229 ?iata .  # IATA code not optional
  OPTIONAL{?airline wdt:P230 ?icao .}
  OPTIONAL{?airline wdt:P571 ?start .}
  OPTIONAL{?airline wdt:P576 ?end .}
  OPTIONAL { ?airline wdt:P17 ?country .
             ?country wdt:P297 ?countryIso }
  OPTIONAL{?airline wdt:P432 ?callsign .}
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

https://w.wiki/7PhS

jpatokal commented 9 months ago

Here are dry runs against prod, both the full output and a grep of rename operations only -- please find bugs! If there's no major/systemic problems I'll pull the trigger in prod tomorrow, which will update 1296 airlines.

update_airlines.log

renamed_airlines.log

reedy commented 9 months ago

Woo; wikidata. I've pinged some people at Wikimedia Deutschland to let them know about the move to using wikidata :)

reedy commented 9 months ago

Here are dry runs against prod, both the full output and a grep of rename operations only -- please find bugs! If there's no major/systemic problems I'll pull the trigger in prod tomorrow, which will update 1296 airlines.

update_airlines.log

renamed_airlines.log

from the rename log... Not reviewed every line, but this stuck out to me:

>> MATCH WITH RENAME: from Myanma Airways to Myanmar National Airlines
>> MATCH WITH RENAME: from Myanma Airways to Union of Burma Airways

Not sure we can rename it twice? Or is there some hidden/simplified characters in there not being displayed? Or more likely, two different IATA airline codes for the same "display" name, at least, in the OpenFlights DB?

If it is that, we should probably output the IATA and/or the ICAO codes for the airlines in the logs

Then we rename something to the old name; which is fine itself, I think.

>> MATCH WITH RENAME: from Burma Airways to Myanma Airways
reedy commented 9 months ago

Screenshot 2023-09-07 at 17 34 02

reedy commented 9 months ago

The updated log looks more reasonable though... I guess the grep/rename shows the condensed version.

BUT.

> MATCH Myanma Airways (UB/UBA, Wikidata) == Myanma Airways (UB/UBA, 3569)
UPDATE airlines SET source='Wikidata', start_date='1989-04-01', end_date='2014-12-01', active='N' WHERE alid=3569
...
>> MATCH WITH RENAME: from Myanma Airways to Union of Burma Airways
> MATCH Union of Burma Airways (UB/UBA, Wikidata) == Myanma Airways (UB/UBA, 3569)
>> UPDATED: field name from Myanma Airways to Union of Burma Airways
UPDATE airlines SET name='Union of Burma Airways', alias='Myanma Airways', start_date='1948-12-15', end_date='1972-12-01' WHERE alid=3569
>> MATCH WITH RENAME: from Union of Burma Airways to Burma Airways
> MATCH Burma Airways (UB/UBA, Wikidata) == Union of Burma Airways (UB/UBA, 3569)
>> UPDATED: field name from Union of Burma Airways to Burma Airways
UPDATE airlines SET name='Burma Airways', alias='Union of Burma Airways', start_date='1972-12-01', end_date='1989-04-01' WHERE alid=3569
>> MATCH WITH RENAME: from Burma Airways to Myanma Airways
> MATCH Myanma Airways (UB/UBA, Wikidata) == Burma Airways (UB/UBA, 3569)
>> UPDATED: field name from Burma Airways to Myanma Airways
UPDATE airlines SET name='Myanma Airways', alias='Burma Airways', start_date='1989-04-01', end_date='2014-12-01' WHERE alid=3569
>> MATCH WITH RENAME: from Myanma Airways to Myanmar National Airlines
> MATCH Myanmar National Airlines (UB/UBA, Wikidata) == Myanma Airways (UB/UBA, 3569)
>> UPDATED: field name from Myanma Airways to Myanmar National Airlines
UPDATE airlines SET name='Myanmar National Airlines', alias='Myanma Airways', start_date='2014-12-01' WHERE alid=3569

And trim out the comments...

UPDATE airlines SET source='Wikidata', start_date='1989-04-01', end_date='2014-12-01', active='N' WHERE alid=3569
UPDATE airlines SET name='Union of Burma Airways', alias='Myanma Airways', start_date='1948-12-15', end_date='1972-12-01' WHERE alid=3569
UPDATE airlines SET name='Burma Airways', alias='Union of Burma Airways', start_date='1972-12-01', end_date='1989-04-01' WHERE alid=3569
UPDATE airlines SET name='Myanma Airways', alias='Burma Airways', start_date='1989-04-01', end_date='2014-12-01' WHERE alid=3569
UPDATE airlines SET name='Myanmar National Airlines', alias='Myanma Airways', start_date='2014-12-01' WHERE alid=3569

As they all have the same alid...

We can see it makes a few different updates (at least 3 of which are no-ops by the end; not the end of the world, but not the most useful/performant), and will result in a row that (incorrectly) has an end date, for the current name.

It will also incorrectly be marked as active='N' from the initial SQL query, which isn't what we want, I imagine

It kinda looks like we need multiple aliases for an airline (which is probably best served via a separate table, for ease of searchability, rather than storing multiple in one column and the fun that might bring) and can store start/end dates there too...

jpatokal commented 9 months ago

I don't think we need aliases: the ideal end state would be a chain of airlines with start/end dates that smoothly cover the entire date range. Unfortunately Wikidata is quite inconsistent here, there's at least three different relations for inactive/followed by/replaced by and particularly for obscure stuff like Myanmar the data quality is poor anyway.

And the other problem is that the updater script mostly matches and renames the existing entry, which isn't consistent with this. So I think it's back to the drawing board here.

reedy commented 9 months ago

Unfortunately Wikidata is quite inconsistent here, there's at least three different relations for inactive/followed by/replaced by and particularly for obscure stuff like Myanmar the data quality is poor anyway.

The joys of crowd sourcing...

Some of this could be unintentional; different people adding different things for different reasons. But they could be different succession types too, to be really unhelpful.

Depending on how wide spread the issue is, I could bring it up on the wiki, and see if we can get any normalisation etc. Or just fix them myself.

I don't think we need aliases: the ideal end state would be a chain of airlines with start/end dates that smoothly cover the entire date range.

Aha, so if that was the intention, we don't want to be always updating using the same airline id, but would potentially also mean updating airline ids in existing flights depending on the date...

jpatokal commented 9 months ago

Yup, updating the airlines to match dates would have to be retrofitted later. This seems like it should be quite straightforward, although there's still some nuance to eg. picking the passenger airline instead of the cargo operation...