Can't search for entities like & or <

GoogleCodeExporter commented 8 years ago

The library  make entities safe when constructing the XML document like:

$multivalue = htmlspecialchars($multivalue, ENT_NOQUOTES, 'UTF-8');

however - this means that I can't search for one of of the entities.  I'm
not sure if this patch will break anything else, but it at least fixes my
keyword searches (using the dismax handler).  

however, this might break full lucene syntax queries that use The symbol &&
in place of "AND".  Maybe that's a reasonable tradeoff?

Original issue reported on code.google.com by pwola...@gmail.com on 10 Dec 2009 at 2:20

Attachments:

entities-30-0.patch

GoogleCodeExporter commented 8 years ago

The Solr server will take anything that my _documentToXmlFragment method 
encodes with htmlspecialchars  
and return it to its original state. What is probably happening is that your 
source content has HTML entities in 
it before it even gets to my conversion, but you want to be able to search on 
the character the entities 
represent. For that, you should normalize your source content before giving it 
to Solr (or perhaps there is a 
filter that can do this for you on the Solr side?) - a possible method would be 
to use html_entity_decode 
http://us3.php.net/manual/en/function.html-entity-decode.php on any fields you 
suspect to have pre-
existing entities.  I however can't do this in the client because it would be a 
transform on the data that 
someone else may not want (there may actually be someone out there that would 
want to be able to search on 
the full token "©" or similar.  

On my Solr 1.4 instance, I was able to verify that I could find a document 
indexed with a field "text" with value 
"& < >" by the searches "text:&", "text:<", and "text:>" - just as it seems you 
want to do. Those characters in 
the document's field would have been converted to entities and then decoded by 
Solr when indexed. For 
reference, the text field had the following type definition in my schema.xml:

    <!-- Less flexible matching, but less false matches.  Probably not ideal for product names,
         but may be good for SKUs.  Can insert dashes in the wrong place and still match. -->
    <fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

So, if what I've explained represents and understanding of your issue, I think 
your problem is best solved by 
either normalizing your source content to not contain HTML entities before you 
pass it to Solr or by using an 
appropriate field type filter defined within your schema.xml on the Solr server.

Original comment by donovan....@gmail.com on 10 Dec 2009 at 4:53

GoogleCodeExporter commented 8 years ago

Taking a look at our code - thanks for the pointers

Original comment by pwola...@gmail.com on 10 Dec 2009 at 1:42

GoogleCodeExporter commented 8 years ago

Original comment by donovan....@gmail.com on 8 Feb 2010 at 6:26

Changed state: Done

oceanduan / solr-php-client

Can't search for entities like & or < #30