pelias / schema

elasticsearch schema files and tooling
MIT License
40 stars 76 forks source link

add dedicated concordance field #480

Open missinglink opened 3 years ago

missinglink commented 3 years ago

ping! @pelias/contributors this PR is a discussion with code attached 🚀

this year has seen some work around recording and exposing 'concordances' (the WOF term for foreign key references). these concordances are valuable to organisations who also use the foreign ID system and would like an easy way of joining Pelias GIDs with other datasets.

Screenshot 2021-10-13 at 13 53 33

the existing implementation works great, looking at Germany in WOF you can see it returns a treasure trove of useful concordances in the addendum.

one problem we've identified with using the addendum is that it's (by definition) only semi-structured and comes without many guarantees of correctness or availability.

what would be better is if concordances were more structured and formalised within Pelias so that they could be considered a public API which integrators could rely upon for a 'crosswalk' between datasets.

this PR would potentially open the door for that, it could be combined with a PR to pelias/model to perform the validation. the validation rules would need a little thought, but things like casing, delimiters, abbreviations, collisions, etc would need to be considered.

there is also a secondary concern (beyond simply displaying the information), which is that users may also wish to search on these values, this is certainly never going to be possible with the addendum.

introducing a new parameter would need a bit more discussion but what comes to mind is the /v1/place endpoint could support concordance lookup, either via the existing ?ids= param or a new one.

thoughts?

missinglink commented 3 years ago

a bit more info on the code in this PR, the new field is called concordance and is an object type mapping with string keys (so basically it's the same sort of structure as an Object in javascript).

I think this would be preferable to something like how we do category where it's more analogous to a javascript Array.

The dynamic_templates thing is because the object keys are generated dynamically and would (by default) create fields with the default mapping, we instead define a specific mapping which sets the type to keyword.

orangejulius commented 3 years ago

Yeah, this makes a lot of sense, and I really like the idea of querying for concordances on the place endpoint. What do you think would be a good query format for that?

My memory is a bit hazy, but I think we should be able to query on those keyword fields easily, right? We don't need to do anything else: aggregations, keywords, or regular full text search.

missinglink commented 3 years ago

Yeah exactly, so it's set to keyword which means there's no analysis (it's just full token exact matching), so no synonyms or anything like that are applied.

It's currently set to doc_values=false because it doesn't make sense to run aggregations on unique values anyway.

So yeah, basically if you write a match query and it matches exactly its a hit, else not, nothing remotely fancy going on.

What do you think would be a good query format for that?

Good question, so you could just /v1/place?ids=gn:id@2222 although I'm not a big fan of mixing and matching our GID values with others, the ?id param isn't ?gid so 🤷‍♂️

Otherwise we could be more explicit and say something like /v1/place?concordance=gn:id@2222,wk:page@Germany

TBH I haven't given that enough thought, neither of those sounds very nice.

[edit] due to using an object type mapping we have key->value pairs, so it would require a convention (such as the @ in the example above) which delimited K from V.

orangejulius commented 3 years ago

I agree that reusing the ids parameter is not ideal.

A concordance= param would work, but like you described we would have to handle both the "field" and "value" side of the concordance query. I also think we'd really want to put some effort into making the concordance names a bit more friendly. gn:id and wk:page (and all the others as they are stored in WOF) are pretty cryptic if you don't know what they stand for.

I guess all this would complicate the /v1/place endpoint a bit, since it would support queries by ids or concordance (but not both?). That might still be worth it.

Joxit commented 3 years ago

:+1: we should not use ids for concordance.

A feature like this would be very interesting, especially with the OSM data :+1: