lugensa / scorched

Sunburnt offspring solr client
MIT License
27 stars 19 forks source link

Highlighting results at top level of SolrResponse isn't sufficient #41

Closed mlissner closed 7 years ago

mlissner commented 7 years ago

If you do a query in the current version of scorched, you'll get back a SolrReponse object that has a bunch of properties, but the important ones are:

response.result.docs is a list of the first N results you requested.

There's a fantastic feature for pagination and iteration that allows you to iterate a SolrResponse object, so you can do:

for r in response.results:
     print r.my_field

But if you make a query that involves highlighting, this utterly fails. What you want to do is something like:

for r in response.results:
    print r.my_field, r.my_highlighted_field

But the highlighted fields are a separate property on the SolrReponse, and aren't part of the iterated object. This makes it basically impossible to return highlighted results without pre-processing the SolrResponse to merge the highlighting attribute with the docs.

In sunburnt there was code that did exactly this, creating a solr_highlights property on every result document that contained the highlighting for that document:

if result.highlighting:
    for d in result.result.docs:
        # if the unique key for a result doc is present in highlighting,
        # add the highlighting for that document into the result dict
        # (but don't override any existing content)
        # If unique key field is not a string field (eg int) then we need to
        # convert it to its solr representation
        unique_key = self.schema.fields[self.schema.unique_key].to_solr(d[self.schema.unique_key])
        if 'solr_highlights' not in d and \
               unique_key in result.highlighting:
            d['solr_highlights'] = result.highlighting[unique_key]

I think we need something like this or else highlighting is very difficult to use and requires that the calling code do some wonky merging.

I think the easiest place to fix this is in the to_json method of the SolrResponse. Maybe this can be fixed with a constructor, but I haven't looked into that yet.

mlissner commented 7 years ago

Looked at this some more today. A constructor won't work because the constructor only has access to the documents, not to the result object.

Would a PR addressing this be welcome or should I proceed to override the code for my own work?

lujh commented 7 years ago

A PR which adds the highligthed_field to every response object would be nice. Will it raise an AttributeError, in case there is no highligthing?

mlissner commented 7 years ago

Will it raise an AttributeError, in case there is no highligthing?

I think it would raise a KeyError actually, since the highlighting is a dict.

But after looking at this some more today, I ran into what I think is a big issue. To add this information to each document, I need to look it up in the highlighting attribute of the response. The highlighting attribute is a dict and the keys are the ID values for each item. So what I need to do is:

  1. Iterate over each document.
  2. Get its ID value from its ID field.
  3. Look up that value in the highlighting attribute on the response.
  4. Copy that value to the document.

I can't do step 2 because I don't know which field is the ID field. I can assume that it's the field named id if it's there, but that doesn't have to be the case. You can configure any field in Solr to be your unique key.

I don't see a way in the code to learn what the unique key field is in a schema. Is there a way to do this? In the example above you can see that sunburnt had something like this in self.schema.unique_key.

(I've already got tests, and honed in on where to make these changes, but I'm stuck beyond that.)

mlissner commented 7 years ago

Feel free to ignore the long comment above. This is resolved in the PR I just sent.

delijati commented 7 years ago

Done