pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.23k stars 1.95k forks source link

Str.find() counts accented characters twice #14190

Open lmocsi opened 9 months ago

lmocsi commented 9 months ago

Checks

Reproducible example

import polars as pl
a = pl.DataFrame({'text': '3612030701 árvíztűrő! 3612030701 árvíztűrő!'})
(a.with_columns(idx = (pl.col('text').str.slice(1).str.find(pl.col('text').str.slice(0,5), literal=True)+1).fill_null(0))
  .with_columns(text2 = pl.col('text').str.slice(0,pl.col('idx')-1))
 ).rows()

Log output

[('3612030701 árvíztűrő! 3612030701 árvíztűrő!',
  26,
  '3612030701 árvíztűrő! 361')]

Issue description

.str.find() should return 22 but it returns 26 as if accented characters were counted as 2 characters

Expected behavior

.str.find() should count accented characters as one character (not twice)

Installed versions

``` --------Version info--------- Polars: 0.20.6 Index type: UInt32 Platform: Linux-4.18.0-372.71.1.el8_6.x86_64-x86_64-with-glibc2.28 Python: 3.9.13 (main, Oct 13 2022, 21:15:33) [GCC 11.2.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.0.0 connectorx: deltalake: fsspec: 2022.02.0 gevent: hvplot: matplotlib: 3.8.0 numpy: 1.23.5 openpyxl: 3.0.9 pandas: 2.1.0 pyarrow: 15.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: 1.4.27 xlsx2csv: xlsxwriter: 3.1.3 ```
ritchie46 commented 9 months ago

I think it returns the bytes offset. I am guessing that the accented chars take two bytes in unicode.

I think we should update the docs to make clear that we return byte offsets.

ghuls commented 9 months ago

You don't have to guess:

# First 22 characters.
In [11]: '3612030701 árvíztűrő! 3612030701 árvíztűrő!'[:22]
Out[11]: '3612030701 árvíztűrő! '

# First 26 bytes.
In [12]: '3612030701 árvíztűrő! 3612030701 árvíztűrő!'.encode('utf-8')[:26]
Out[12]: b'3612030701 \xc3\xa1rv\xc3\xadzt\xc5\xb1r\xc5\x91! '

# First 26 bytes converted back to utf-8.
In [13]: '3612030701 árvíztűrő! 3612030701 árvíztűrő!'.encode('utf-8')[:26].decode('utf-8')
Out[13]: '3612030701 árvíztűrő! '
lmocsi commented 9 months ago

So if I'd like to remove repetitions from a string, then something this should happen:

import polars as pl
a = pl.DataFrame({'text': ['3612030701 árvíztűrő! 3612030701 árvíztűrő!','3612030701 arvizturo! 3612030701 arvizturo!']})
(a.with_columns(idx = (pl.col('text').str.slice(1).str.find(pl.col('text').str.slice(0,5), literal=True)+1).fill_null(0))
  .with_columns(text2 = pl.col('text').str.slice(0,pl.col('idx')-1))
 ).rows()

Results:
[('3612030701 árvíztűrő! 3612030701 árvíztűrő!',
  26,
  '3612030701 árvíztűrő! 361'),
 ('3612030701 arvizturo! 3612030701 arvizturo!', 22, '3612030701 arvizturo!')]

In the case of the non-accented characters, it is done like that. If talking about accented characters, then the number of accented characters (b in the example bellow) should be substracted from idx, something like (going back to native python):

a='3612030701 árvíztűrő!'.lower()
b = a.count('á')+a.count('é')+a.count('í')+a.count('ó')+a.count('ö')+a.count('ő')+a.count('ú')+a.count('ü')+a.count('ű')
print(b)

Results:
4

Is there something like this str.count() in polars?

On the other hand re.search() returns starting position in characters, not bytes:

import re
a = '3612030701 árvíztűrő! 3612030701 árvíztűrő!'
x = re.search(r'36120', a[1:])
print(x.start()+1)

Results:
22

Is there such a support planned? Maybe an option in str.find(), something like characters=True?

lmocsi commented 9 months ago

Ugly workaround solutions can be applied like this:

import polars as pl
a = pl.DataFrame({'text': ['3612030701 árvíztűrő! 3612030701 árvíztűrő!','3612030701 arvizturo! 3612030701 arvizturo!']})
(a.with_columns(idx = (pl.col('text').str.slice(1).str.find(pl.col('text').str.slice(0,5), literal=True)+1).fill_null(0))
  .with_columns(text2 = pl.col('text').str.slice(0,pl.col('idx')-1)
                          .map_elements(lambda x: x[:len(x)-(x.lower().count('á')+x.lower().count('é')+x.lower().count('í')+x.lower().count('ó')+x.lower().count('ö')+x.lower().count('ő')+x.lower().count('ú')+x.lower().count('ü')+x.lower().count('ű'))]))
 ).rows()

Results:
[('3612030701 árvíztűrő! 3612030701 árvíztűrő!', 26, '3612030701 árvíztűrő!'),
 ('3612030701 arvizturo! 3612030701 arvizturo!', 22, '3612030701 arvizturo!')]

Here the idx is incorrect, but the end result (text2 field) is correct.

lmocsi commented 9 months ago

If str.find() returns results in bytes, which function does slicing by bytes (like str.slice() does it by characters)?

cmdlineluser commented 9 months ago

Is it possible to do with .extract?

pl.Config(fmt_str_lengths=120)

df.with_columns(dedupe = 
   pl.col('text').str.extract(
      pl.format('({}.*?){}.*',
         pl.col('text').str.slice(0, 5),
         pl.col('text').str.slice(0, 5),
      )
   )
)

# shape: (2, 2)
# ┌─────────────────────────────────────────────┬───────────────────────┐
# │ text                                        ┆ dedupe                │
# │ ---                                         ┆ ---                   │
# │ str                                         ┆ str                   │
# ╞═════════════════════════════════════════════╪═══════════════════════╡
# │ 3612030701 árvíztűrő! 3612030701 árvíztűrő! ┆ 3612030701 árvíztűrő! │
# │ 3612030701 arvizturo! 3612030701 arvizturo! ┆ 3612030701 arvizturo! │
# └─────────────────────────────────────────────┴───────────────────────┘

Although it would need regex_escape exposed #12154 to prevent text being parsed as a regex.