trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.5k stars 3.02k forks source link

lower() function does not encode Sigma correctly #24229

Open Jason-Waldrop opened 2 days ago

Jason-Waldrop commented 2 days ago

https://en.wikipedia.org/wiki/Sigma

Sigma: uppercase Σ, lowercase σ, lowercase in word-final position ς;

Trino does currently convert each "Σ" into a "σ" char.

select
    a,
    lower(a),
    lower(a) = 'νεστορας βλσχος i.k.e.',   -- will is false -> should be true
    LOWER(regexp_replace(a, 'Σ\b', 'ς')),
    LOWER(regexp_replace(a, 'Σ\b', 'ς')) = 'νεστορας βλσχος i.k.e.'  -- will be true
from (values('ΝΕΣΤΟΡΑΣ ΒΛΣΧΟΣ I.K.E.')) as t(a)

this can be used as a quickfix:

LOWER(regexp_replace(lower_me_col, 'Σ\b', 'ς'))
wendigo commented 2 days ago

@martint can you confirm that this is an expected behaviour?

Converts slice to lower case code point by code point. This method does not perform perform locale-sensitive, context-sensitive, or one-to-many mappings required for some languages. Specifically, this will return incorrect results for Lithuanian, Turkish, and Azeri.
Note: Invalid UTF-8 sequences are copied directly to the output.