inveniosoftware / invenio

Invenio digital library framework
https://invenio.readthedocs.io
MIT License
626 stars 292 forks source link

WebSearch: search_unit_in_solr to respect quotes #2668

Open kaplun opened 9 years ago

kaplun commented 9 years ago

Currently, as soon as one uses the fulltext index backed by SOLR to search for a word containing a quote or a double quote, SOLR receives an invalid query and crashes.

  File "invenio/search_engine.py", line 2379, in search_unit
    return search_unit_in_solr(p, f, m)
  File "invenio/search_engine.py", line 2862, in search_unit_in_solr
    return solr_get_bitset(f, p)
  File "invenio/solrutils_bibindex_searcher.py", line 74, in solr_get_bitset
    u = SOLRUTILS_OPENER.open(invenio_query_url)
  File "python2.6/urllib2.py", line 397, in open
    response = meth(req, response)
  File "python2.6/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "python2.6/urllib2.py", line 435, in error
    return self._call_chain(*args)
  File "python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "python2.6/urllib2.py", line 518, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)

This happens because:

def search_unit_in_solr(p, f=None, m=None):
    """
    Query a Solr index and return an intbitset corresponding
    to the result.  Parameters (p,f,m) are usual search unit ones.
    """
    if m and (m == 'a' or m == 'r'): # phrase/regexp query
        if p.startswith('%') and p.endswith('%'):
            p = p[1:-1] # fix for partial phrase
        p = '"' + p + '"'
    return solr_get_bitset(f, p)

doesn't properly handle quotes.

tiborsimko commented 9 years ago

Do you have concrete use case? I'm asking because the following query:

fulltext:O'Connell

works well on CDS, including snippets.

jalavik commented 9 years ago

When I investigated this last in the far past, I recall that I saw the issue happening in advanced search where one of the p1..3 is such a query with a quote (single or double). If that helps.