hbz / lobid-organisations

Transformation, web frontend, and API for lobid-organisations
http://lobid.org/organisations
Eclipse Public License 2.0
13 stars 3 forks source link

Maximize number of containedIn statements #253

Open acka47 opened 7 years ago

acka47 commented 7 years ago

Currently, we have ~6700 entries with a containedIn link to geonames. We get this by querying geonames with the Gemeindeschlüssel, see the current morp-enriched, lines 331-335

The link is missing for ~15k, see http://beta.lobid.org/organisations/search?q=_missing_:containedIn. As mapzen also provides geonames data, we should probably query it with address plus Gemeindeschlüssel (if available).

fsteeg commented 7 years ago

Would be possible with Mapzen API [1, 2], but that would both significantly increase our Mapzen call count, and the transformation runtime. The current approach based on the Gemeindeschlüssel is much more efficient. Since we also want to improve RS/AGS (see #134), I think we should stay with the current approach and try to improve the numbers from that side.

[1] https://search.mapzen.com/v1/search?text=50676+Köln&sources=geonames&layers=coarse [2] http://www.geonames.org/2886242

fsteeg commented 7 years ago

With the increased number of AGS values on staging, we have more containedIn values, too:

http://beta.lobid.org/organisations/search?q=containedIn: http://test.lobid.org/organisations/search?q=containedIn:

These are still way lower than the AGS numbers. Should there be a geonames value for every AGS? Are these missing in https://raw.githubusercontent.com/hbz/lookup-tables/master/data/geonames-map.csv? @SBRitter, where did you get that data from?

acka47 commented 7 years ago

Here is the current status from the perspective of entries missing a ContainedIn statement:

http://beta.lobid.org/organisations/search?q=_missing_:containedIn http://test.lobid.org/organisations/search?q=_missing_:containedIn

SBRitter commented 7 years ago

@fsteeg, I'm sorry, I don't know and I can't find hints in the git log. Maybe from a HashMap by @dr0i? Going to think about it...

dr0i commented 7 years ago

In the lodmill repo is the file https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/resources/geonames_DE.csv. It's a simple csv, the source must be http://download.geonames.org/export/dump/DE.zip. Transformation of this csv is done in lodmill repo using metamorph (https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/resources/morphGeonamesCsv2ld.xml) to create triples, which are linked by the Gemeindeschlüssel-Object (found in ISIL field 032P.n , see https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/resources/morph_zdb-isil-file-pica2ld.xml#L110)).

dr0i commented 7 years ago

Ah, and in the lodmill csv there are way more entries (~180k) than in comparison to the lookup-table repo (~11k).

fsteeg commented 7 years ago

OK, trying to understand how to proceed here and how it's all connected. From https://github.com/hbz/lobid-organisations/issues/253#issuecomment-245848983 it seems the issue is that we are missing too many containedIn fields, which should be created from the ags field, for which less is missing:

http://test.lobid.org/organisations/search?q=_missing_:containedIn http://test.lobid.org/organisations/search?q=_missing_:ags

A potential solution would be to use https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/resources/geonames_DE.csv, which contains data occuring in 032P.n in its first column, instead of https://raw.githubusercontent.com/hbz/lookup-tables/master/data/geonames-map.csv.

fsteeg commented 7 years ago

In an offline discussion, @acka47 mentioned that we might actually not need containedIn at all.

Instead, #268 might be the relevant area to improve the data.

In any case, it seems this should not be blocking the launch.

acka47 commented 7 years ago

In an offline discussion, @acka47 mentioned that we might actually not need containedIn at all.

Links to GeoNames address some interesting use cases, e.g. you can get data on population size and make queries like: get all museums in places with < 10,000 residents. As this will be easy to fix as soon as #268 is finished, we shouldn't close it.

acka47 commented 7 years ago

With our own pelias service running, we could do it like @fsteeg suggested in https://github.com/hbz/lobid-organisations/issues/253#issuecomment-244728544.

acka47 commented 7 years ago

See also https://github.com/lobid/lodmill/issues/488.