Open lmocsi opened 9 months ago
I think it returns the bytes offset. I am guessing that the accented chars take two bytes in unicode.
I think we should update the docs to make clear that we return byte offsets.
You don't have to guess:
# First 22 characters.
In [11]: '3612030701 árvíztűrő! 3612030701 árvíztűrő!'[:22]
Out[11]: '3612030701 árvíztűrő! '
# First 26 bytes.
In [12]: '3612030701 árvíztűrő! 3612030701 árvíztűrő!'.encode('utf-8')[:26]
Out[12]: b'3612030701 \xc3\xa1rv\xc3\xadzt\xc5\xb1r\xc5\x91! '
# First 26 bytes converted back to utf-8.
In [13]: '3612030701 árvíztűrő! 3612030701 árvíztűrő!'.encode('utf-8')[:26].decode('utf-8')
Out[13]: '3612030701 árvíztűrő! '
So if I'd like to remove repetitions from a string, then something this should happen:
import polars as pl
a = pl.DataFrame({'text': ['3612030701 árvíztűrő! 3612030701 árvíztűrő!','3612030701 arvizturo! 3612030701 arvizturo!']})
(a.with_columns(idx = (pl.col('text').str.slice(1).str.find(pl.col('text').str.slice(0,5), literal=True)+1).fill_null(0))
.with_columns(text2 = pl.col('text').str.slice(0,pl.col('idx')-1))
).rows()
Results:
[('3612030701 árvíztűrő! 3612030701 árvíztűrő!',
26,
'3612030701 árvíztűrő! 361'),
('3612030701 arvizturo! 3612030701 arvizturo!', 22, '3612030701 arvizturo!')]
In the case of the non-accented characters, it is done like that. If talking about accented characters, then the number of accented characters (b in the example bellow) should be substracted from idx, something like (going back to native python):
a='3612030701 árvíztűrő!'.lower()
b = a.count('á')+a.count('é')+a.count('í')+a.count('ó')+a.count('ö')+a.count('ő')+a.count('ú')+a.count('ü')+a.count('ű')
print(b)
Results:
4
Is there something like this str.count(
On the other hand re.search() returns starting position in characters, not bytes:
import re
a = '3612030701 árvíztűrő! 3612030701 árvíztűrő!'
x = re.search(r'36120', a[1:])
print(x.start()+1)
Results:
22
Is there such a support planned? Maybe an option in str.find(), something like characters=True?
Ugly workaround solutions can be applied like this:
import polars as pl
a = pl.DataFrame({'text': ['3612030701 árvíztűrő! 3612030701 árvíztűrő!','3612030701 arvizturo! 3612030701 arvizturo!']})
(a.with_columns(idx = (pl.col('text').str.slice(1).str.find(pl.col('text').str.slice(0,5), literal=True)+1).fill_null(0))
.with_columns(text2 = pl.col('text').str.slice(0,pl.col('idx')-1)
.map_elements(lambda x: x[:len(x)-(x.lower().count('á')+x.lower().count('é')+x.lower().count('í')+x.lower().count('ó')+x.lower().count('ö')+x.lower().count('ő')+x.lower().count('ú')+x.lower().count('ü')+x.lower().count('ű'))]))
).rows()
Results:
[('3612030701 árvíztűrő! 3612030701 árvíztűrő!', 26, '3612030701 árvíztűrő!'),
('3612030701 arvizturo! 3612030701 arvizturo!', 22, '3612030701 arvizturo!')]
Here the idx is incorrect, but the end result (text2 field) is correct.
If str.find() returns results in bytes, which function does slicing by bytes (like str.slice() does it by characters)?
Is it possible to do with .extract
?
pl.Config(fmt_str_lengths=120)
df.with_columns(dedupe =
pl.col('text').str.extract(
pl.format('({}.*?){}.*',
pl.col('text').str.slice(0, 5),
pl.col('text').str.slice(0, 5),
)
)
)
# shape: (2, 2)
# ┌─────────────────────────────────────────────┬───────────────────────┐
# │ text ┆ dedupe │
# │ --- ┆ --- │
# │ str ┆ str │
# ╞═════════════════════════════════════════════╪═══════════════════════╡
# │ 3612030701 árvíztűrő! 3612030701 árvíztűrő! ┆ 3612030701 árvíztűrő! │
# │ 3612030701 arvizturo! 3612030701 arvizturo! ┆ 3612030701 arvizturo! │
# └─────────────────────────────────────────────┴───────────────────────┘
Although it would need regex_escape exposed #12154 to prevent text
being parsed as a regex.
Checks
Reproducible example
Log output
Issue description
.str.find() should return 22 but it returns 26 as if accented characters were counted as 2 characters
Expected behavior
.str.find() should count accented characters as one character (not twice)
Installed versions