rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.42k stars 902 forks source link

[FEA] Optionally support titlecase for capitalize #14144

Open revans2 opened 1 year ago

revans2 commented 1 year ago

Is your feature request related to a problem? Please describe. Spark has a method calling initcap. We implemented this using strings::capitalize, but recently ran into some problems because the first letter it uses is not an uppercase letter, it is a title case letter.

https://unicode.org/faq/casemap_charprop.html#4

Most of the time they are the same, but there are a few cases where they are not and ß is one of them. I would love an option for capitalize that uses title case instead of upper case. Or if we could get a separate initcap function that uses title case would also be great.

davidwendt commented 1 year ago

For reference here is a list of the suspect characters: https://www.compart.com/en/unicode/category/Lt Seems the cudf::strings::capitalize() should handle these by default (no special option) if possible.

revans2 commented 1 year ago

Great to hear that CUDF will do it by default. I ma a little concerned because ß is the one that bit us in our testing, but it does not show up in https://www.compart.com/en/unicode/category/Lt

davidwendt commented 10 months ago

So the ß character looks to be a separate special case. The upper-case of ß is actually SS (two capital S's) which the code already supports:

>>> import cudf
>>> s = 'ßeta'
>>> s.upper()
'SSETA'
>>> gs = cudf.Series([s])
>>> gs.str.upper()
0    SSETA

But it looks like when capitalizing ß the second S is not upper-cased in Python:

>>> s.capitalize()
'Sseta'
>>> gs.str.capitalize()
0    SSeta

I've not been able to find documentation on this behavior so I would be curious to know what is expected by Spark when capitalizing ß I did a quick test with the capitalize() function from org.apache.commons.lang3.StringUtils and got a different result as well. Also, the upperCase() and String.toUpperCase() functions both return SSETA.

revans2 commented 10 months ago
val df = Seq("ßeta", "Sseta").toDF
df.selectExpr("value", "upper(value)", "lower(value)", "initcap(value)", "lower(upper(value))").show()
+-----+------------+------------+--------------+-------------------+
|value|upper(value)|lower(value)|initcap(value)|lower(upper(value))|
+-----+------------+------------+--------------+-------------------+
| ßeta|       SSETA|        ßeta|          ßeta|              sseta|
|Sseta|       SSETA|       sseta|         Sseta|              sseta|
+-----+------------+------------+--------------+-------------------+

I hope that this helps. Strings in Spark are kind of special as they wrote their own UTF8String implementation upper is UTF8String.toUpperCase, lower is UTF8String.toLowerCase, and initcap is UTF8String.toLowerCase.toTitleCase.

davidwendt commented 10 months ago

The initcap() appears to match results I see with org.apache.commons.lang3.StringUtils.capitalize() both of which just pass through the ß character unchanged.

I found a few more characters that are not part of the titlecase Unicode definition and behave like ß:

ß   (223) -> SS (83,83)     : Ss (83,115)
և  (1415) -> ԵՒ (1333,1362) : Եւ (1333,1410)
ff (64256) -> FF (70,70)     : Ff (70,102)
fi (64257) -> FI (70,73)     : Fi (70,105)
fl (64258) -> FL (70,76)     : Fl (70,108)
ffi (64259) -> FFI (70,70,73) : Ffi (70,102,105)
ffl (64260) -> FFL (70,70,76) : Ffl (70,102,108)
ſt (64261) -> ST (83,84)     : St (83,116)
st (64262) -> ST (83,84)     : St (83,116)
ﬓ (64275) -> ՄՆ (1348,1350) : Մն (1348,1398)
ﬔ (64276) -> ՄԵ (1348,1333) : Մե (1348,1381)
ﬕ (64277) -> ՄԻ (1348,1339) : Մի (1348,1387)
ﬖ (64278) -> ՎՆ (1358,1350) : Վն (1358,1398)
ﬗ (64279) -> ՄԽ (1348,1341) : Մխ (1348,1389)

The Python (and Pandas) output for capitalize() (which also matchestitle()) is included above after the :. Generally, in the multi-character output for upper() the characters after the first character are lower-cased for capitalize() (and title()).

But all of these pass through unchanged with org.apache.commons.lang3.StringUtils.capitalize() so I suspect the same pass through result from initcap() for these as well.

Regardless, the libcudf result matches neither and so the inclination is to fix it to match the Python/Pandas result. I was also able to verify that C++ Boost Locale library supports these characters and match the Python results as well. The boost::locale class is implemented using the ICU library which provides a rich set of globalization functions for software applications.

revans2 commented 9 months ago

Sorry I have not been following this as closely as I should.

@davidwendt so the proposal is to make the CUDF code match python/pandas, but not Spark?

@sameerz if that is true then we will need to write a custom kernel for initcap for Spark.

revans2 commented 9 months ago

Just FYI: From a Spark perspective I found 265 characters that produce different values between the CPU implementation and the GPU one. Their code points are.

(223, 304, 329, 452, 454, 455, 457, 458, 460, 496, 497, 499, 604, 609, 618, 620, 642, 647, 669, 670, 912, 944, 1011, 1012, 1321, 1323, 1325, 1327, 1415, 4304, 4305, 4306, 4307, 4308, 4309, 4310, 4311, 4312, 4313, 4314, 4315, 4316, 4317, 4318, 4319, 4320, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4330, 4331, 4332, 4333, 4334, 4335, 4336, 4337, 4338, 4339, 4340, 4341, 4342, 4343, 4344, 4345, 4346, 4349, 4350, 4351, 5112, 5113, 5114, 5115, 5116, 5117, 7296, 7297, 7298, 7299, 7300, 7301, 7302, 7303, 7304, 7566, 7830, 7831, 7832, 7833, 7834, 7838, 8016, 8018, 8020, 8022, 8064, 8065, 8066, 8067, 8068, 8069, 8070, 8071, 8080, 8081, 8082, 8083, 8084, 8085, 8086, 8087, 8096, 8097, 8098, 8099, 8100, 8101, 8102, 8103, 8114, 8115, 8116, 8118, 8119, 8130, 8131, 8132, 8134, 8135, 8146, 8147, 8150, 8151, 8162, 8163, 8164, 8166, 8167, 8178, 8179, 8180, 8182, 8183, 8486, 8490, 8491, 42649, 42651, 42900, 42903, 42905, 42907, 42909, 42911, 42933, 42935, 42937, 42939, 42941, 42943, 42947, 43859, 43888, 43889, 43890, 43891, 43892, 43893, 43894, 43895, 43896, 43897, 43898, 43899, 43900, 43901, 43902, 43903, 43904, 43905, 43906, 43907, 43908, 43909, 43910, 43911, 43912, 43913, 43914, 43915, 43916, 43917, 43918, 43919, 43920, 43921, 43922, 43923, 43924, 43925, 43926, 43927, 43928, 43929, 43930, 43931, 43932, 43933, 43934, 43935, 43936, 43937, 43938, 43939, 43940, 43941, 43942, 43943, 43944, 43945, 43946, 43947, 43948, 43949, 43950, 43951, 43952, 43953, 43954, 43955, 43956, 43957, 43958, 43959, 43960, 43961, 43962, 43963, 43964, 43965, 43966, 43967, 64256, 64257, 64258, 64259, 64260, 64261, 64262, 64265, 64266, 64267, 64268, 64269, 64275, 64276, 64277, 64278, 64279)