hbz / lobid-gnd

UI and API to the Integrated Authority File (Gemeinsame Normdatei, GND)
http://lobid.org/gnd
Eclipse Public License 2.0
24 stars 5 forks source link

Using range query parameter via OpenRefine Reconciliation #304

Closed b2m closed 2 years ago

b2m commented 2 years ago

The general API of lobid-gnd supports several Elasticsearch Query Parameters like wildcards via * and ranges like [1920 TO 1950].

See the Blogpost lobid-gnd: Formulierung komplexer Suchanfragen for details.

Some of these features also seem to work when using lobid-gnd via the OpenRefine Reconciliation endpoint, like wildcards or fuzzy search. But trying out date ranges produce unexpected results in OpenRefine.

Example

Example via web search

Here is an contrived example:

I want to reproduce the following search via OpenRefine:

The results from the web search are awesome.

Example via OpenRefine Reconciliation

To reproduce this in OpenRefine I used the following data: lobid-gnd-test1

The Range for the birthday was used as additional property in the reconciliation request: lobid-gnd-test2

But the results quite differ from the ones I tried to reproduce: lobid-gnd-test3

Using full date ranges produce even more unexpected results: lobid-gnd-test4

Conclusion

I conclude that I am either using the range query parameter via the reconciliation API wrong or it is not supported (yet). Either way I would be happy for some clarification about which Elastic Search query parameters are supported and whether it is planned to support range queries via the OpenRefine reconciliation API.

fsteeg commented 2 years ago

Thanks for getting in touch! Many special characters are removed here. We are doing that because these caused issues when they are part of the data (see #188 / #190). But maybe we could check for the range pattern specifically and support that.

b2m commented 2 years ago

Many special characters are removed here.

Thanks for providing the reference to the source code. In essence this means that the following query parameters are also not usable:

fsteeg commented 2 years ago

Since ranges with TO and groups with AND/OR are quite specific, I think it makes sense to support these, and that should not result in issues with unexpected data as in #188. If a value contains such a range or group, no special characters are replaced in that value.

Deployed to http://test.lobid.org/gnd/reconcile, see e.g.:

{"query":"Benedikt Papst","properties":[{"pid":"dateOfBirth","v":"[1920-01-01 TO 1950-01-01]"}]}

@b2m Would this work for you?

b2m commented 2 years ago

Thx, this looks promissing!

For me today this would be enough, I mean I could just send (needle+ OR needle) and have the support of the query string "mini language". But me in 5 years (months? 🤔) will forget about this and again would wonder why needle+ ist not working.

So it would suggest to expand the trigger for "not removing special characters" to the other supported patterns like (\+|\-)\w, (<|>)=?\d, \w^\d, and (OR|AND).

This would make a clear distinction between:

  1. User is sending "garbage", therefore special characters are removed.
  2. User is sending structured content, let's assume he knows what he is doing.

ElasticSearch also has a validation endpoint to check queries before executing them. Not sure about the impact on performance but maybe this is an alternative approach to consider.

fsteeg commented 2 years ago

ElasticSearch also has a validation endpoint to check queries before executing them. Not sure about the impact on performance but maybe this is an alternative approach to consider.

That's a good point, thanks! I did a quick and dirty test and this might work. I'll need a bit time for proper implementation though, and I'm currently not sure when we'll have time for that. Depending on what you prefer @b2m we can either deploy this as it currently is and open a new issue, or leave this open.

b2m commented 2 years ago

So a new issue it is =)

I'll close this one as the main part will be solved by #306 and #309 looks like a promising replacement or addition.

fsteeg commented 2 years ago

Great, thanks for opening the new issue.

Deployed to production, see e.g.:

{"query":"Benedikt Papst","properties":[{"pid":"dateOfBirth","v":"[1920-01-01 TO 1950-01-01]"}]}