symphonists / search_index

Search Index provides an easy way to implement high performance fulltext searching on your Symphony site
32 stars 21 forks source link

Problems with multi-byte UTF-8 characters #13

Closed michael-e closed 13 years ago

michael-e commented 13 years ago

The extension does not handle multibyte UTF-8 characters properly.

So I guess the extension needs some utf-encoding/-decoding functions in the right places so that PHP can work with the strings correctly. Unfortunately I don't know enough about the inner workings to fix this myself, but I will do testing! (I really think that this extension is a big step forward for Symphony as a whole.)

nickdunn commented 13 years ago

Thanks mate. I must admit I haven't tested with characters outside of English so I'm not surprised it doesn't work with multibyte chars. Do you have any other examples I can test with? I'll take a look into this next week - I know a few places that are troublesome.

michael-e commented 13 years ago

I will try and select another example. But I am rather sure that the döner will reveal nearly everything! :-)

Do you know what a Döner is? Well, it's a doner. I like the word in German, it sounds so funny!

nilshoerrmann commented 13 years ago

Michael, search for kebap and everything will be fine ;)

nickdunn commented 13 years ago

In English it's a boring "donna" but I'm guessing German is more "durna"?

michael-e commented 13 years ago

Ah, no, the "ö" sounds more funny here. :-) Like in the word "schön".

michael-e commented 13 years ago

http://dict.leo.org/ende?lp=ende&lang=de&searchLoc=0&cmpType=relaxed&sectHdr=on&spellToler=&search=döner

Click on the loudspeaker symbol.

nickdunn commented 13 years ago

Pretty sure I've fixed these in: https://github.com/nickdunn/search_index/commit/68d2b394539f93c7be0d011d749cef262e4bc7a6

All three of your bugs are fixed by the above commit.

To re-test you will need to:

michael-e commented 13 years ago

That looks very good now, Nick.

Well, one thing: The excerpt-length setting does not work anymore.

nickdunn commented 13 years ago

I'm pretty sure this didn't work before either :-/

michael-e commented 13 years ago

Well, at least the excerpts were a lot shorter than they are now. I haven't counted characters. I should have done that. :-)

nickdunn commented 13 years ago

Will raise it as another ticket as it doesn't seem related to the changes I made today. It's more down to the massive chunks of code that I lifted from Drupal 5. I'll see how Drupal 7/8 did it ;-)

michael-e commented 13 years ago

Thanks, Nick.

nickdunn commented 13 years ago

(I really think that this extension is a big step forward for Symphony as a whole

Cheers! Any other feedback or ideas you've got, please let me know.

nickdunn commented 13 years ago

This was caused by the mb string changes. There is a simple fix I have committed locally to rectify the excerpt length bug.

michael-e commented 13 years ago

Glad to hear this. You will push it soon, right?

Can you explain why entites in the excerpt are encoded (as posted here)? Is this an issue or a feature?

nickdunn commented 13 years ago

Glad to hear this. You will push it soon, right?

Pushed.

Can you explain why entites in the excerpt are encoded... Is this an issue or a feature?

Both ;-) The excerpt is run through General::sanitize to HTML-encode things like ampersands that break XML, hence everything is encoded. However this happens after the <strong> tags are wrapped around the highlighted keywords, so these tags also get encoded. The net result is that you have to use an xsl:value-of with disable-output-escaping so that the entities are rendered properly. It works, but is a bug ugly.

I have just changed the implementation so that the string is HTML-encoded and then the <strong> tags are added. This means that you now have to use an xsl:copy-of to get the excerpt (a value-of will work but will just get the plain text, not the strong elements too).

<xsl:copy-of select="excerpt/text() | excerpt/*" />

That itself is a bit messy, so I decided to wrap the excerpt in a <p> tag, so the XSLT becomes:

<xsl:copy-of select="excerpt/*" />

Nice and simple, just like a Markdown-formatted textarea.

michael-e commented 13 years ago

Thanks a lot for the fix! At first I was a little bit amazed at the the length difference from one excerpt to another (in the same search). But then I realized that there is some "intelligent logic" behind this, so the length of an excerpt depends on the position of a match in the string and the number of matches alike.

So: Well Dunn! :-)

I like the new implementation as well. It is clear and straightforward. And in case you don't like the excerpt to be wrapped in the <p> Element, you can still use

<xsl:copy-of select="excerpt/*/*|excerpt/*/text()"/>
nickdunn commented 13 years ago

so the length of an excerpt depends on the position of a match in the string and the number of matches alike

Yep. I did very little beyond modifying Drupal's search_excerpt function which does the hard work. It means, however, that the excerpt doesn't match exactly the length set in the config — it's just a rough length. But I don't ever see a need for it to be completely precise.