hpcc-systems / DataPatterns

HPCC Systems ECL bundle that provides some basic data profiling and research tools to an ECL programmer
3 stars 4 forks source link

Profile: Popular and rare patterns have problems with non-ASCII input #80

Closed dcamper closed 11 months ago

dcamper commented 11 months ago

The Profile code uses regex's [[:upper:]] and [[:lower:]] to test characters, but not all Unicode alphabetic characters map to either. Those characters are therefore passed along as-is. The problem is, the pattern field itself is a STRING, so HPCC performs a coercion of those as-is characters and you wind up with something different (valid coercion, I-Don't-Know characters, etc).

One solution is to convert the pattern field to UTF8. Another is to find a way to map those types of Unicode characters better.

dcamper commented 11 months ago

Fixed in release v1.9.3.