ioos / ckanext-ioos-theme

IOOS Catalog as a CKAN extension
GNU Affero General Public License v3.0
7 stars 14 forks source link

SOLR search index results not filtering as expected #253

Closed mwengren closed 1 month ago

mwengren commented 2 months ago

Separating this issue out from #252

We're seeing poor filtering results from the Solr index.

If I try to search by an individual GCOOS dataset id (see this search for 'Data for ioos-station-wmo-42400'), I get essentially a full list of datasets returned (~76.068 total datasets). The dataset order does appear to be sorted at least (most relevant results at top), but there is essentially no filtering happening on the count in the results set.

Testing today on the simple search string: 'M01':

Without quotes: M01 yields ~60,590 results: https://data.ioos.us/dataset/?q=M01&sort=score+desc%2C+metadata_modified+desc&ext_timerange_start=&ext_timerange_end=&ext_min_depth=&ext_max_depth=&ext_bbox=

With quotes: "M01" yields: 33 results: https://data.ioos.us/dataset/?q=%22M01%22&sort=score+desc%2C+metadata_modified+desc&ext_timerange_start=&ext_timerange_end=&ext_min_depth=&ext_max_depth=&ext_bbox=

Another example, searching for osu592-20230524T1813-delayed and org=Glider DAC without and with quotes changes results from 6877 datasets to 2 datasets.

Results are more reasonable for other simple phrase searches like 'Mote' or 'NERACOOS':

Search for 'NERACOOS' ~397 results: https://data.ioos.us/dataset/?q=NERACOOS&sort=score+desc%2C+metadata_modified+desc&ext_timerange_start=&ext_timerange_end=&ext_min_depth=&ext_max_depth=&ext_bbox=

mwengren commented 2 months ago

@benjwadams says it might be a query syntax issue or a proximity (like phrases) issue with how Solr is configured.

benjwadams commented 2 months ago

It's very likely how the free-text search is configured in the stock CKAN schema

If you remove the "T" from the ISO8601 date strings in the search you will get much more reasonable results. Letters adjacent to numbers appear to be getting tokenized separately. Quoting will also work, but this may not be immediately obvious.

benjwadams commented 1 month ago

Addressed in https://github.com/ioos/catalog-docker-base/commit/61ad2bcc5e2b0f92e636528e0787c4b70f25060c. Issues with glider names returning exorbitant numbers of search results have been fixed.