Simple keyword search tests flawed

pandzel-zz commented 9 years ago

There are several tests which belong to some sort of general family of tests which I would call: "term search". It could be a single term search or multiple term search and it may apply to either CSW GetRecords request with "q" parameter or OpenSearch request using "searchTerm" parameter. Basically, all such tests are created in such a way that search term (or terms) isused to search the catalog, then a term (or terms) is expected to be present in the body of each returned record. There are several reasons why, in my opinion, this expectation is wrong:

Response format Let's imagine a metadata document which happen to have a particular term present in it's body, although it is buried deep down in the structure tree - enough to be indexed by some full text search engine like Apache Lucene. Any search request will yield that record, however the final Atom response won't have that particular element in it's body.
Stemmers Some catalogs might have linguistic stemmers deployed. In any such case, search for "rivers" (plural) will also yield records with "river" (singular). Test will fail because it won't be able to find "rivers".
Tags Some indexes might have added tags or some sort of information coming from somewhere else than metadata. Some catalogs may allow users to categorize stored metadata, for example one can declare a record to belong to "live data" category. Then search for "live data" will bring all such records even they don't have "live data" term explicitly embedded within metadata document.
Multi term It is expected that in case of multi term search, all terms from the input request will be present in the records. Not sure where this requirement comes from, but looking how the Apache Lucene works or Google Search works (both without additional search modifiers), this requirement seem to be to far fetched.In particular, I didn't find any requirement like that on OpenSearch specification. Maybe I am wrong, but a general (de facto) understanding is that in case of a multiple terms searches all records matching at least one term will be returned.

rjmartell commented 9 years ago

According to Table 6 in OGC 12-176r6, the "text search" KVP query includes the "q" parameter. Let's call this "simple text search". It is described in the spec by a single sentence:

"Comma separated list of search terms that are used to search all text fields in a catalogue record."

And that's it. It does seem to be somewhat underspecified. There are a couple of things to note here:

There is no explicit mention of how multiple terms are to be interpreted.
OpenSearch appears to use a space char as the separator, not a comma; although this is not formally specified all of the examples show a space-separated list of terms (OGC 10-032r8 does the same)

Note: OGC 10-032r8 (OpenSearch Geo and Time Extensions) recommends restricting search terms to three record fields: dc:title, dc:description, and dc:subject.

rjmartell commented 9 years ago

Hopefully the next revision of the spec will clarify these matters. Following the principle of least astonishment, the test suite currently assumes that a simple text search is interpreted somewhat like a Google search:

exact match (case-insensitive, though)
implicit AND

Should the CSW3 spec clarify the expected behavior in the next revision, the corresponding tests will be updated accordingly. Please consider submitting a CR to the SWG--several implementers have been kicking around ideas.

rjmartell commented 9 years ago

The q parameter should probably be reserved for specifying a "simple" text search. The AnyText pseudo-property could be used to express more sophisticated full-text queries. However, the spec is silent about this common queryable, so maybe it's intended as a vendor-specific extension point (the general model doesn't have anything to say about it either).

mhogeweg commented 9 years ago

I think we all agree the spec is open for interpretation regarding the q parameter. We suggest to not enforce one particular interpretation as you describe above. The comparison to Google search does not solve this, as Google has a lot more happening than the exact match/implicit AND you mention.

Case in point: search Google for hottentotten tenten tentoonstelling and the first results show hottentottententententoonstelling (which happens to be a valid Dutch word that could be made significantly longer if desired). this search would probably fail the test.

In the meantime, I recommend we drop this specific test for the CSW 3.0 specification since it breaks on catalogs implementing common search engine behaviors mentioned above. Those catalogs would have to implement specific behaviors just to pass this test that do not benefit the users.

bermud commented 9 years ago

@rjmartell, since the spec is not very clear, can we make the test more general in a way that it makes sense instead of removing it?

rjmartell commented 9 years ago

As written, the basic text search facility in the candidate CSW3 spec offers no guidance whatsoever to implementers and thus is, strictly speaking, untestable. However, if the intent of the spec authors is to permit implementers to do whatever they want, then I would agree it doesn't belong in a conformance test suite.

In the interest of clarifying the expected behavior of "basic" text search, the test suite imposes two requirements (as noted above). But if implementers cannot agree on how such a query should be processed then it should be expunged from the spec altogether.

An alternative is to just check for a non-empty result set and ignore the actual content of the matching records (whether the answer is correct or not).

mhogeweg commented 9 years ago

Since CSW is an interface, the test in my view should focus on whether the source implements the interface correctly. whether the answer is correct, or even useful, is beyond the scope of that interface (and hopefully the implementers actually make sure their catalogs return results relevant to the question...).

With GetRecords what is the structure of the response that is expected? that is something that could be tested fairly unambiguously.

rjmartell commented 9 years ago

I'm hard pressed to imagine that the content of the response is irrelevant when attempting to verify that an interface has been implemented correctly. However, in this case most implementers seem to agree that the spec is too ambiguous to serve as an authoritative test "oracle".

So, pending clarification in the final spec the tests for simple keyword searches will ignore the content of the records that are purported to satisfy the query. The result set, however, is expected to be non-empty.

opengeospatial / ets-cat30

Simple keyword search tests flawed #10