project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

When sorting by name, ignore case, punctuation, whitespace, and diacritics #363

Open brent-hartwig opened 1 week ago

brent-hartwig commented 1 week ago

Problem Description: In the context of search results, names that begin with lowercase letters are sorted after or before names that begin with uppercase letters. Case should not be factored into sort. We would like to take this further by also ignoring punctuation, whitespace, and diacritics, when sorting search results by name.

In the following example, the descending sort ordered s, Z and Y when Z, Y, s is desired, thereby ignoring case:

image

Expected Behavior/Solution: Sort names such that case, punctuation, whitespace, and diacritics are ignored.

This may be done by configuring the associated indexes with a different collation. For the anySortNameEn field, the collation should be http://marklogic.com/collation/en/S1/AS/T0000. Per https://docs.marklogic.com/guide/search-dev/encodings_collations, en is for English, S1 declares case and diacritic insensitivity, AS configures the collation to go by the variable top value for variable characters, and T0000 is the variable top value that declares whitespace and punctuation insensitivity.

TBD if we're able to specify other languages when only having the English language licensing option.

TBD how to handle a multi-language index, such as the anySortName field.

Requirements: List of details required for the completion of the issue or requirements for the feature/bug. This can also include requirements that lie outside of the teams such as new design docs or clarification from an outside source.

Needed for promotion: If an item on the list is not needed, it should be crossed off but not removed.

~- [ ] Wireframe/Mockup - Mike~

UAT/LUX Examples:

Dependencies/Blocks:

Related Github Issues:

Related links:

Wireframe/Mockup: N/A

brent-hartwig commented 1 week ago

@prowns and @roamye, please note that while I found this in the context of semantic sort (not yet in production), this issue is present in production. One just needs to sort by Title/Label. Here's the same search with different sort directions:

  1. Brown chair items, descending: First item starts with lowercase "f" and is followed by a result that starts with an uppercase "W".
  2. Brown chair items, ascending: First item starts with punctuation but should be sorted with the Hs.
roamye commented 10 hours ago

@brent-hartwig - I want to clarify the second example.

For all items with punctuations(" , . , \ ), they should be sorted by their first character/digit. So,
image should be treated as Hopeless Cases....... Is that correct?

brent-hartwig commented 8 hours ago

@roamye, as a user, when sorting by title, I would prefer results to be sorted by their first alphanumeric character. As such, I'd find "Hopeless Cases.... in the H's. Likewise, I'd want h'ordourves sorted by hor and thus, for ascending, following "Hopeless Cases....'s hop (case insensitive). But, these are just my thoughts.