opengeospatial / ogcapi-records

An open standard for the discovery of geospatial resources on the Web.
https://ogcapi.ogc.org/records
Other
56 stars 26 forks source link

q: Search terms delimited by comma?! #295

Closed m-mohr closed 7 months ago

m-mohr commented 1 year ago

Porting over the discussion from https://github.com/opengeospatial/ogcapi-records/pull/273#discussion_r1218620027_

@m-mohr wrote:

How are search terms defined and generated from the q parameter? Is it basically: terms = q.split(' ') ? So is space the only separator and everything in-between spaces (or the beginning/end of the string) is a term?

@pvretano wrote:

The definition of the parameter is:

---
name: q
in: query
required: false
schema:
  type: array
  items:
    type: string
explode: false
style: form

so the search terms are comma-separated and so the comma delimits the terms.

@m-mohr wrote:

@pvretano Good point, on the other hand that's not what users expect, I think. Users type free text phrases with spaces into Google, I've never seen anyone type commas. This would eventually mean that you are not able to simply pass through the text field value, but instead you have to do something like searchTerm.replace(' ', ',') beforehand. That just doesn't feel right to me, but maybe it's just me?

pvretano commented 1 year ago

@m-mohr, no, I think you are right. I'll create a new PR that changes the q parameter so that its value is a white-space-separated list of search terms.

cportele commented 1 year ago

I disagree. This is an API and not a Google search bar.

If the search is for multiple terms then the schema is an array of strings and the comma is the separator. If you don't want a comma, maybe use explode: true so that we have ...&q=foo&q=bar.

Otherwise we would be introducing or own micro-format that everyone has to parse on their own (even though splitting at spaces is not very difficult).

m-mohr commented 1 year ago

I could also live with commas, but it should be specified more clearly in the spec. It's bad that people need to figure this out from the OpenAPI fragment. The docs should clearly state that commas are the delimiter and spaces have no special meaning, which means that they are not splitting into multiple terms.

pvretano commented 1 year ago

@m-mohr @cportele OK ... I'll leave it as it is but add clarifying text in the specification to point out that it is a comma separated list and that spaces have no special meaning.

tomkralidis commented 9 months ago

cc @pvgenuchten @kalxas @mhogeweg

Sorry to bring this up again after missing this discussion.

I agree with @m-mohr in that , doesn't feel right as a separator (vs. spaces).

Doing some quick tests against Google, Yahoo, and Bing, their support of q supports (at least):

  1. space separated
  2. case-insensitive
  3. multiple terms imply exclusive search (AND) by default
  4. multiple terms for an inclusive search be optionally OR’d
  5. + and - for included and excluded terms

Some search engine implementations also support the above behaviour (for example, Elasticsearch).

There are obviously more complex semantics, but perhaps items 1-3 should be considered for our support of q to capture core mass market semantics into something more "familiar" for a user?

m-mohr commented 9 months ago

In STAC we have two conformance classes now:

  1. Basic: Based on Records, just comma separated words
  2. Advanced: More advanced capabilities, similar to what Tom mentions.

See: https://github.com/cedadev/stac-freetext-search

mhogeweg commented 9 months ago

@cportele makes a good point. Back-end implementation gets mixed in with the API.

The search engines don’t just split terms by space. Search for San Diego, etc.

See

You could define a simple minimal syntax to be supported and allow anything else. In our Geoportal we allow submitting full Elastic/Opensearch queries and ideally that can be supported via OGC Records and STAC as well.

pvgenuchten commented 9 months ago

to me, q= is a special case, unlike {fieldname}=value or cql

to me, q= represents a free-text-search type of field, which allows to enter a text string to find close (fuzzy) matches

I like the suggestion of @mhogeweg to adopt a minimal syntax to define FTS queries, somewhere in between the advanced operators of google/microsoft, and the fts queries of elastic or postgres

The suggestions by the STAC team at https://github.com/cedadev/stac-freetext-search#http-get seem a good starting point, although I think I would suggest to combine /search?q=climate model to climate AND model, to limit results when you add terms.

rob-metalinkage commented 9 months ago

Perhaps it's better to allow conformance metadata to specify the behaviour and leave q as truly free text.

tomkralidis commented 9 months ago

2022-10-05: this was further discussed during an editing meeting. We decided to leave q= as currently specified, for reasons of simplicity. @pvretano will add additional informative text as part of the OAB submission target.

m-mohr commented 9 months ago

I appreciate that as we already based our STAC extension on it. :-)

pvretano commented 9 months ago

All, in order to try and balance the desire for a simple text search capability and also satisfy those that want something more, I have created PR #314 that adds some text around the q parameter and also slightly enhances its capabilities.

The original specification of the q parameter indicated that search terms are comma separated implying a logical OR. That is, if any of the specified search terms appears in one or more of the text fields in a record then that record can be included in the result set.

The slight modification that I made is to say that search terms can contain white spaces and this means that all the space-separated search terms must appear in one or more of the text fields in a record before that record can be included in the result set. So, consider q=ocean,climate%20%09change,desalination. In this example, a record can be included in the result set if one or more of the text fields of that record contain the terms "ocean" OR ("climate" AND change") OR "desalination".

Please review the PR #314 and let me know if this is sataisfactory (keeping in mind that we are looking for a simple-to-implement capability) OR if I should fall back to the original, simpler, specification of the q parameter.

On a personal note, I really don't want to add too much syntax to the q parameter because it does not make sense to me to add yet another query language to the mix when we already have CQL for handling advanced cases. The idea with q is to provide something simple that covers a large set of use cases. When I think about the Google search operator, for example, almost no one that I know is aware of the additional syntax that Google supports.

pvretano commented 9 months ago

16-OCT-2023 SWG Meeting: The concensus in the SWG is that allowing white space in a search term is good but the interpretation should not be "'term' AND 'term'" but rather "'term''term'" so that a search term like climate%20%09change would file the term "climate change" as a combined term the record rather than what happens now which would match "change climate". The SWG believes that these changes represent 80% of how something like a Google search box is used. @pvretano will make the necessary changes to the PR and merge.