imls-dmt / imls-dmt-api

imls-dmt-api
Apache License 2.0
1 stars 1 forks source link

Search overhaul #10

Closed hbarrett closed 4 years ago

hbarrett commented 4 years ago

The search API should use a POST method instead of a GET method. This will allow for complex queries (match,simple,starts-with etc.)to be expressed to the API via JSON objects.

karlbenedict commented 4 years ago

1). An additional search use case occurred to me while driving home last night - in addition to the "simple" (google-style search box with a "within" search for provided words against a set of pre-defined fields [e.g. title, abstract, keywords, author, discipline, etc. TBD]), and "advanced" (search strings + operators [e.g. within, =, starts with] + target field] search elements that may be combined with implicit AND operators [this is the multi-element JSON submission model that we discussed]) searches, we should also support "faceted" search where the user can select facets (i.e. controlled vocabulary based lists from which users can select additional filter conditions - e.g. FRAMEWORK = "DATAone") in the result display interface to further limit the result set. This additional search use case is probably an implementation issue for the user interface - maintaining search state and resubmitting a new modified search JSON document that reflects the modified search criteria - but I wanted to get it on your radar as you are updating the API.

In an extra-fancy world, the facet values presented to the user would parenthetically display the number of items that would be in the result set if that specific facet search item was added to the existing query. For example, in the included annotated screenshot from Data.gov, we are seeing the initial result set of 7,741 items that were returned for the "simple" search string of "new mexico RGIS". I could further limit the search results by selecting one of the facets on the left side of the result set - for example "Climate" under the list of "Topics". The parenthetical value following "Climate" in the list of topics is the number of records that would be returned if an additional search criterion were added to the existing simple search - adding (in pseudo-search-code) "Topics == 'Climate'" with an AND to the existing query.

Datasets_-_Data_gov

2) It would also be good if the search API provided an option for returning sorted and paginated results along the lines of the sorting options provided by Data.gov (in the next attached screenshot). My recollection from yesterday's search demos was that you are already getting paginated results. We just need to superimpose a sorting on top of the pagination so that if we sort by "Last Modified" for example, page 1 of the results includes the most recently modified, page 2 contains the next most recently modified, etc.

Datasets_-_Data_gov

hbarrett commented 4 years ago

We could run the search provided with the group option and return a count with all facets as part of the return JSON, but to group the vocabularies in this way we would have to store them as a string(exact match) instead of the standard general text that breaks the string into individual units for search. As an example I indexed regularly and then grouped by publisher and limited to 5 results. The following was the result. As you can see it grouped by each unit instead of each string. (see groupValue)

{
    "responseHeader": {
        "status": 0,
        "QTime": 2,
        "params": {
            "q": "*:*",
            "fl": "publisher",
            "rows": "5",
            "group.field": "publisher",
            "_": "1580312544330",
            "group": "true"
        }
    },
    "grouped": {"publisher": {
        "matches": 490,
        "groups": [
            {
                "groupValue": "dataone",
                "doclist": {
                    "numFound": 10,
                    "start": 0,
                    "docs": [{"publisher": "DataONE"}]
                }
            },
            {
                "groupValue": "earth",
                "doclist": {
                    "numFound": 34,
                    "start": 0,
                    "docs": [{"publisher": "Federation of Earth Science Information Partners (ESIP Federation)"}]
                }
            },
            {
                "groupValue": null,
                "doclist": {
                    "numFound": 54,
                    "start": 0,
                    "docs": [{}]
                }
            },
            {
                "groupValue": "laboratory",
                "doclist": {
                    "numFound": 1,
                    "start": 0,
                    "docs": [{"publisher": "Oak Ridge National Laboratory"}]
                }
            },
            {
                "groupValue": "lab",
                "doclist": {
                    "numFound": 2,
                    "start": 0,
                    "docs": [{"publisher": "Mozilla Science Lab"}]
                }
            }
        ]
    }}
}

I re-indexed using "type":"string" in the field definition and ran the search again. The following is the result. This is now grouping correctly and giving us the count of each publisher.

{
    "responseHeader": {
        "status": 0,
        "QTime": 33,
        "params": {
            "q": "*:*",
            "fl": "publisher",
            "rows": "5",
            "group.field": "publisher",
            "_": "1580312544330",
            "group": "true"
        }
    },
    "grouped": {"publisher": {
        "matches": 490,
        "groups": [
            {
                "groupValue": "DataONE",
                "doclist": {
                    "numFound": 10,
                    "start": 0,
                    "docs": [{"publisher": "DataONE"}]
                }
            },
            {
                "groupValue": "Federation of Earth Science Information Partners (ESIP Federation)",
                "doclist": {
                    "numFound": 34,
                    "start": 0,
                    "docs": [{"publisher": "Federation of Earth Science Information Partners (ESIP Federation)"}]
                }
            },
            {
                "groupValue": null,
                "doclist": {
                    "numFound": 54,
                    "start": 0,
                    "docs": [{}]
                }
            },
            {
                "groupValue": "Oak Ridge National Laboratory",
                "doclist": {
                    "numFound": 1,
                    "start": 0,
                    "docs": [{"publisher": "Oak Ridge National Laboratory"}]
                }
            },
            {
                "groupValue": "Mozilla Science Lab",
                "doclist": {
                    "numFound": 2,
                    "start": 0,
                    "docs": [{"publisher": "Mozilla Science Lab"}]
                }
            }
        ]
    }}
}

The problem is that now a search for publisher:Federation yields:

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"publisher:Federation",
      "rows":"500",
      "_":"1580312544330"}},
  "response":{"numFound":0,"start":0,"docs":[]
  }}

This is because when we indexed as a string we lost the ability to search by the individual units.

Thoughts?

hbarrett commented 4 years ago

I should note that the same problem applies to facet queries. It looks like the solution that most have applied is to add the the data twice. Once as normal text and once as a string. Example:

[
    {"add-field": {
        "name": "publisher",
        "type": "text",
        "multiValued": false,
        "stored": true,
        "required": false,
        "indexed": true
    }},
    {"add-field": {
        "name": "fpublisher",
        "type": "string",
        "multiValued": false,
        "stored": true,
        "required": false,
        "indexed": true
    }}
]
karlbenedict commented 4 years ago

It seems that we might want to do the paired fields for facets that we know we want to expose - i.e. those that are probably associated with fixed and controlled vocabularies. Perhaps starting with:

@njhoebel - what do you think of this set of fields that we should enable faceted search with counts for. This doesn't necessarily mean that all of them would end up in the interface, but just that they are available for a faceted-style of searching if we want to use them.

hbarrett commented 4 years ago

I realized that grouping searches might be important. If you want true OR then you need grouping otherwise the ands will trump the ors. I extended the search JSON that we talked about by making the search a series of groups. A simple example:

{"search": [{
    "group": "and",
    "and": [{
        "string": "Data archiving",
        "field": "keywords",
        "type": "match"
    }]
}]}

This would result in querystatus:true AND (keywords:"Data archiving") The status:true is hard-coded and cannot be modified. You can also do much more complex searches:

{
    "search": [
        {
            "group": "and",
            "and": [
                {
                    "field": "keywords",
                    "string": "Data archiving",
                    "type": "match"
                },
                {
                    "field": "title",
                    "string": "DataONE",
                    "type": "simple"
                },
                {
                    "field": "submitter_name",
                    "string": "Amber E.  Budden",
                    "type": "match"
                }
            ],
            "or": [{
                "field": "author",
                "string": "sophisticus",
                "type": "simple"
            }]
        },
        {
            "group": "or",
            "and": [{
                "field": "author",
                "string": "Nhoebelheinrich",
                "type": "simple"
            }]
        }
    ],
    "limit": 3,
    "offset": 1
}

This would result in the query status:true AND (keywords:"Data archiving" AND title:DataONE AND submitter_name:"Amber E. Budden" OR author:sophisticus) OR (author:Nhoebelheinrich) This way we can have an advanced OR that can be used with AND.

I also added some additional facets. A working example with facets is running now and can be found at /api/resources/documentation.html

hbarrett commented 4 years ago

Any other improvements and bugs should be noted in separate issues.