uncefact / spec-jsonld

Exposing the UN/CEFACT vocabulary as web semantics
https://service.unece.org/trade/uncefact/vocabulary/uncefact/
13 stars 5 forks source link

Duplicates in UN/LOCODE list #115

Closed kshychko closed 1 year ago

kshychko commented 2 years ago

It has been discovered that there are duplicates in the output stream files for UN/LOCODE publication: https://unece.org/trade/cefact/UNLOCODE-Download

Attached is the CSV file with duplicates, most of them are different only by names (swapped places): AX | MHQ | AXMHQ | Maarianhamina (Mariehamn) | Maarianhamina (Mariehamn) |   | 1--4---- | AI | 1207 |   | 6006N 01957E AX | MHQ | AXMHQ | Mariehamn (Maarianhamina) | Mariehamn (Maarianhamina) |   | 1--4---- | AI | 1207 |   | 6006N 01957E

But there are cases when IATA code is different: US | EWB | USEWB | Fall River-New Bedford Apt | Fall River-New Bedford Apt | MA | 1--4---- | AI | 201 US | EWB | USEWB | New Bedford-Fall River Apt | New Bedford-Fall River Apt | MA | 1--4---- | AI | 9506

And even different subdiviosns and functions list: US | LEB | USLEB | Hanover-Lebanon-White River Apt | Hanover-Lebanon-White River Apt | NH | --34---- | AI | 307 |   | 4338N 07215W US | LEB | USLEB | Lebanon-White River-Hanover Apt | Lebanon-White River-Hanover Apt | VT | ---4---- | AI | 9601 |   |   US | LEB | USLEB | White River-Hanover-Lebanon Apt | White River-Hanover-Lebanon Apt | VT | ---4---- | AI | 1 |   |  

The example above also has one record with geo location defined, and the other two without it, probably related to the difference.

If the first two examples are just confusing but can be explained, the third one seems to me like wrong data, but my question is how to generate JSON-LD entity for them, which value to take and which to ignore. And in the first two cases as we don't include the IATA code we probably can pick the first one and ignore the rest, but the third example is more difficult to handle.

@cmsdroff , could you please comment on this?

mikaelgu80 commented 2 years ago

The swapping of the names for some places comes directly from the guidance/specifications for locode: https://service.unece.org/trade/locode/Service/LocodeColumn.htm#Name. It's very problematic that there is no language code attached to the names and instead it's handled this way. E.g in Finland (including the Åland islands) there are many places with names in both Finnish and Swedish, but there currently is no way to distinguish which entry is which. I know this doesn't even try to tackle the issue, but might at least be some background information to one of the aforementioned situations.

Mikael Gustafsson

On Wed, Aug 24, 2022 at 3:33 PM Kseniya Shychko @.***> wrote:

It has been discovered that there are duplicates in the output stream files for UN/LOCODE publication: https://unece.org/trade/cefact/UNLOCODE-Download

Attached is the CSV file with duplicates, most of them are different only by names (swapped places): AX | MHQ | AXMHQ | Maarianhamina (Mariehamn) | Maarianhamina (Mariehamn) | | 1--4---- | AI | 1207 | | 6006N 01957E AX | MHQ | AXMHQ | Mariehamn (Maarianhamina) | Mariehamn (Maarianhamina) | | 1--4---- | AI | 1207 | | 6006N 01957E

But there are cases when IATA code is different: US | EWB | USEWB | Fall River-New Bedford Apt | Fall River-New Bedford Apt | MA | 1--4---- | AI | 201 US | EWB | USEWB | New Bedford-Fall River Apt | New Bedford-Fall River Apt | MA | 1--4---- | AI | 9506

And even different subdiviosns and functions list: US | LEB | USLEB | Hanover-Lebanon-White River Apt | Hanover-Lebanon-White River Apt | NH | --34---- | AI | 307 | | 4338N 07215W US | LEB | USLEB | Lebanon-White River-Hanover Apt | Lebanon-White River-Hanover Apt | VT | ---4---- | AI | 9601 | | US | LEB | USLEB | White River-Hanover-Lebanon Apt | White River-Hanover-Lebanon Apt | VT | ---4---- | AI | 1 | |

The example above also has one record with geo location defined, and the other two without it, probably related to the difference.

If the first two examples are just confusing but can be explained, the third one seems to me like wrong data, but my question is how to generate JSON-LD entity for them, which value to take and which to ignore. And in the first two cases as we don't include the IATA code we probably can pick the first one and ignore the rest, but the third example is more difficult to handle.

@cmsdroff https://github.com/cmsdroff , could you please comment on this?

— Reply to this email directly, view it on GitHub https://github.com/uncefact/spec-jsonld/issues/115, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAEIFGONFWQGCHC6EPAB2LV2YJCHANCNFSM57PE3Y3A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Mikael

cmsdroff commented 2 years ago

In my opinion and others this shows why IATA should be a child code like others of UNLOCODE, it doesn't really fit into the way UNLOCODES are published and handled. IATA maintain the codes independently and many cities have multiple airports, Paris is another example. Not sure what we can do on these except publish them as is as its the IATA code that makes it unique.

But there are cases when IATA code is different: US | EWB | USEWB | Fall River-New Bedford Apt | Fall River-New Bedford Apt | MA | 1--4---- | AI | 201 US | EWB | USEWB | New Bedford-Fall River Apt | New Bedford-Fall River Apt | MA | 1--4---- | AI | 9506

This one is a good case where we could use labels in the linked data to indicate different names or language for the same place. Challenge is identifying them.

AX | MHQ | AXMHQ | Maarianhamina (Mariehamn) | Maarianhamina (Mariehamn) | | 1--4---- | AI | 1207 | | 6006N 01957E AX | MHQ | AXMHQ | Mariehamn (Maarianhamina) | Mariehamn (Maarianhamina) | | 1--4---- | AI | 1207 | | 6006N 01957E

I will raise this example with the UNLOCODE Maintenance team for guidance although it has different IATA codes I believe the functions should be aligned if the same UNLOCODE, otherwise we differentiate what's at an airport which isn't in scope for UNLOCODE. Co-Ordinates could be different but only because its pointing at the airport, but should it really point at the UNLOCODE. Raising elsewhere

US | LEB | USLEB | Hanover-Lebanon-White River Apt | Hanover-Lebanon-White River Apt | NH | --34---- | AI | 307 | | 4338N 07215W US | LEB | USLEB | Lebanon-White River-Hanover Apt | Lebanon-White River-Hanover Apt | VT | ---4---- | AI | 9601 | |
US | LEB | USLEB | White River-Hanover-Lebanon Apt | White River-Hanover-Lebanon Apt | VT | ---4---- | AI | 1 | |

Hope this helps, as @mikaelgu80 mentioned above for language this could be handled using the language indicator in the LD but it would require a clean of UNLOCODES.

nissimsan commented 2 years ago

@kshychko, this is not a problem for this project to solve. Please focus on publishing what we're getting.

nissimsan commented 2 years ago

@kshychko, we have to publish unique LOCODEs, so please go with the most simple rule of picking the first unique LOCODE of the list. Ignore subsequent repetitions.

kshychko commented 1 year ago

@cmsdroff , below the list of duplicates I discovered coming across this issue. locodes-duplicates.csv

nissimsan commented 1 year ago

Let's report this upstream and close the issue. (This basically goes for all the semantics issues).

nissimsan commented 1 year ago

Let's submit this upstream and close