Open acka47 opened 8 years ago
Would be possible with Mapzen API [1, 2], but that would both significantly increase our Mapzen call count, and the transformation runtime. The current approach based on the Gemeindeschlüssel is much more efficient. Since we also want to improve RS/AGS (see #134), I think we should stay with the current approach and try to improve the numbers from that side.
[1] https://search.mapzen.com/v1/search?text=50676+Köln&sources=geonames&layers=coarse [2] http://www.geonames.org/2886242
With the increased number of AGS values on staging, we have more containedIn values, too:
http://beta.lobid.org/organisations/search?q=containedIn: http://test.lobid.org/organisations/search?q=containedIn:
These are still way lower than the AGS numbers. Should there be a geonames value for every AGS? Are these missing in https://raw.githubusercontent.com/hbz/lookup-tables/master/data/geonames-map.csv? @SBRitter, where did you get that data from?
Here is the current status from the perspective of entries missing a ContainedIn
statement:
http://beta.lobid.org/organisations/search?q=_missing_:containedIn http://test.lobid.org/organisations/search?q=_missing_:containedIn
@fsteeg, I'm sorry, I don't know and I can't find hints in the git log. Maybe from a HashMap by @dr0i? Going to think about it...
In the lodmill repo is the file https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/resources/geonames_DE.csv. It's a simple csv, the source must be http://download.geonames.org/export/dump/DE.zip. Transformation of this csv is done in lodmill repo using metamorph (https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/resources/morphGeonamesCsv2ld.xml) to create triples, which are linked by the Gemeindeschlüssel-Object (found in ISIL field 032P.n
, see https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/resources/morph_zdb-isil-file-pica2ld.xml#L110)).
Ah, and in the lodmill csv there are way more entries (~180k) than in comparison to the lookup-table repo (~11k).
OK, trying to understand how to proceed here and how it's all connected. From https://github.com/hbz/lobid-organisations/issues/253#issuecomment-245848983 it seems the issue is that we are missing too many containedIn
fields, which should be created from the ags
field, for which less is missing:
http://test.lobid.org/organisations/search?q=_missing_:containedIn http://test.lobid.org/organisations/search?q=_missing_:ags
A potential solution would be to use https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/resources/geonames_DE.csv, which contains data occuring in 032P.n
in its first column, instead of https://raw.githubusercontent.com/hbz/lookup-tables/master/data/geonames-map.csv.
In an offline discussion, @acka47 mentioned that we might actually not need containedIn
at all.
Instead, #268 might be the relevant area to improve the data.
In any case, it seems this should not be blocking the launch.
In an offline discussion, @acka47 mentioned that we might actually not need
containedIn
at all.
Links to GeoNames address some interesting use cases, e.g. you can get data on population size and make queries like: get all museums in places with < 10,000 residents. As this will be easy to fix as soon as #268 is finished, we shouldn't close it.
With our own pelias service running, we could do it like @fsteeg suggested in https://github.com/hbz/lobid-organisations/issues/253#issuecomment-244728544.
Currently, we have ~6700 entries with a
containedIn
link to geonames. We get this by querying geonames with the Gemeindeschlüssel, see the current morp-enriched, lines 331-335The link is missing for ~15k, see http://beta.lobid.org/organisations/search?q=_missing_:containedIn. As mapzen also provides geonames data, we should probably query it with address plus Gemeindeschlüssel (if available).