Open revans2 opened 1 year ago
For reference here is a list of the suspect characters: https://www.compart.com/en/unicode/category/Lt
Seems the cudf::strings::capitalize()
should handle these by default (no special option) if possible.
Great to hear that CUDF will do it by default. I ma a little concerned because ß is the one that bit us in our testing, but it does not show up in https://www.compart.com/en/unicode/category/Lt
So the ß
character looks to be a separate special case.
The upper-case of ß
is actually SS
(two capital S's) which the code already supports:
>>> import cudf
>>> s = 'ßeta'
>>> s.upper()
'SSETA'
>>> gs = cudf.Series([s])
>>> gs.str.upper()
0 SSETA
But it looks like when capitalizing ß
the second S is not upper-cased in Python:
>>> s.capitalize()
'Sseta'
>>> gs.str.capitalize()
0 SSeta
I've not been able to find documentation on this behavior so I would be curious to know what is expected by Spark when capitalizing ß
I did a quick test with the capitalize()
function from org.apache.commons.lang3.StringUtils and got a different result as well. Also, the upperCase()
and String.toUpperCase()
functions both return SSETA
.
val df = Seq("ßeta", "Sseta").toDF
df.selectExpr("value", "upper(value)", "lower(value)", "initcap(value)", "lower(upper(value))").show()
+-----+------------+------------+--------------+-------------------+
|value|upper(value)|lower(value)|initcap(value)|lower(upper(value))|
+-----+------------+------------+--------------+-------------------+
| ßeta| SSETA| ßeta| ßeta| sseta|
|Sseta| SSETA| sseta| Sseta| sseta|
+-----+------------+------------+--------------+-------------------+
I hope that this helps. Strings in Spark are kind of special as they wrote their own UTF8String implementation
upper
is UTF8String.toUpperCase
,
lower
is UTF8String.toLowerCase
, and
initcap
is UTF8String.toLowerCase.toTitleCase
.
The initcap()
appears to match results I see with org.apache.commons.lang3.StringUtils.capitalize()
both of which just pass through the ß
character unchanged.
I found a few more characters that are not part of the titlecase Unicode definition and behave like ß
:
ß (223) -> SS (83,83) : Ss (83,115)
և (1415) -> ԵՒ (1333,1362) : Եւ (1333,1410)
ff (64256) -> FF (70,70) : Ff (70,102)
fi (64257) -> FI (70,73) : Fi (70,105)
fl (64258) -> FL (70,76) : Fl (70,108)
ffi (64259) -> FFI (70,70,73) : Ffi (70,102,105)
ffl (64260) -> FFL (70,70,76) : Ffl (70,102,108)
ſt (64261) -> ST (83,84) : St (83,116)
st (64262) -> ST (83,84) : St (83,116)
ﬓ (64275) -> ՄՆ (1348,1350) : Մն (1348,1398)
ﬔ (64276) -> ՄԵ (1348,1333) : Մե (1348,1381)
ﬕ (64277) -> ՄԻ (1348,1339) : Մի (1348,1387)
ﬖ (64278) -> ՎՆ (1358,1350) : Վն (1358,1398)
ﬗ (64279) -> ՄԽ (1348,1341) : Մխ (1348,1389)
The Python (and Pandas) output for capitalize()
(which also matchestitle()
) is included above after the :
. Generally, in the multi-character output for upper()
the characters after the first character are lower-cased for capitalize()
(and title()
).
But all of these pass through unchanged with org.apache.commons.lang3.StringUtils.capitalize()
so I suspect the same pass through result from initcap()
for these as well.
Regardless, the libcudf result matches neither and so the inclination is to fix it to match the Python/Pandas result.
I was also able to verify that C++ Boost Locale
library supports these characters and match the Python results as well.
The boost::locale
class is implemented using the ICU library which provides a rich set of globalization functions for software applications.
Sorry I have not been following this as closely as I should.
@davidwendt so the proposal is to make the CUDF code match python/pandas, but not Spark?
@sameerz if that is true then we will need to write a custom kernel for initcap for Spark.
Just FYI: From a Spark perspective I found 265 characters that produce different values between the CPU implementation and the GPU one. Their code points are.
(223, 304, 329, 452, 454, 455, 457, 458, 460, 496, 497, 499, 604, 609, 618, 620, 642, 647, 669, 670, 912, 944, 1011, 1012, 1321, 1323, 1325, 1327, 1415, 4304, 4305, 4306, 4307, 4308, 4309, 4310, 4311, 4312, 4313, 4314, 4315, 4316, 4317, 4318, 4319, 4320, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4330, 4331, 4332, 4333, 4334, 4335, 4336, 4337, 4338, 4339, 4340, 4341, 4342, 4343, 4344, 4345, 4346, 4349, 4350, 4351, 5112, 5113, 5114, 5115, 5116, 5117, 7296, 7297, 7298, 7299, 7300, 7301, 7302, 7303, 7304, 7566, 7830, 7831, 7832, 7833, 7834, 7838, 8016, 8018, 8020, 8022, 8064, 8065, 8066, 8067, 8068, 8069, 8070, 8071, 8080, 8081, 8082, 8083, 8084, 8085, 8086, 8087, 8096, 8097, 8098, 8099, 8100, 8101, 8102, 8103, 8114, 8115, 8116, 8118, 8119, 8130, 8131, 8132, 8134, 8135, 8146, 8147, 8150, 8151, 8162, 8163, 8164, 8166, 8167, 8178, 8179, 8180, 8182, 8183, 8486, 8490, 8491, 42649, 42651, 42900, 42903, 42905, 42907, 42909, 42911, 42933, 42935, 42937, 42939, 42941, 42943, 42947, 43859, 43888, 43889, 43890, 43891, 43892, 43893, 43894, 43895, 43896, 43897, 43898, 43899, 43900, 43901, 43902, 43903, 43904, 43905, 43906, 43907, 43908, 43909, 43910, 43911, 43912, 43913, 43914, 43915, 43916, 43917, 43918, 43919, 43920, 43921, 43922, 43923, 43924, 43925, 43926, 43927, 43928, 43929, 43930, 43931, 43932, 43933, 43934, 43935, 43936, 43937, 43938, 43939, 43940, 43941, 43942, 43943, 43944, 43945, 43946, 43947, 43948, 43949, 43950, 43951, 43952, 43953, 43954, 43955, 43956, 43957, 43958, 43959, 43960, 43961, 43962, 43963, 43964, 43965, 43966, 43967, 64256, 64257, 64258, 64259, 64260, 64261, 64262, 64265, 64266, 64267, 64268, 64269, 64275, 64276, 64277, 64278, 64279)
Is your feature request related to a problem? Please describe. Spark has a method calling initcap. We implemented this using strings::capitalize, but recently ran into some problems because the first letter it uses is not an uppercase letter, it is a title case letter.
https://unicode.org/faq/casemap_charprop.html#4
Most of the time they are the same, but there are a few cases where they are not and ß is one of them. I would love an option for capitalize that uses title case instead of upper case. Or if we could get a separate initcap function that uses title case would also be great.