pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.82k stars 17.99k forks source link

BUG: Some string methods treat "." as regex, others don't #37963

Open giangiacomosanna opened 4 years ago

giangiacomosanna commented 4 years ago

Code Sample, a copy-pastable example

import re
import pandas as pd

texts = ['aa.aa', 'bbb']

# str.replace vs re.sub: '.' is not treated as regex by pandas
print(*pd.Series(texts).str.replace('.', '*'), sep=', ')
print(*pd.Series(texts).str.replace('..', '**'), sep=', ')
print(*[re.sub('.', '*', t) for t in texts], sep=', ')
print(*[re.sub('..', '**', t) for t in texts], sep=', ')
print()

# str.split vs re.split: '.' is not treated as regex by pandas
print(*pd.Series(texts).str.split('.'), sep=', ')
print(*pd.Series(texts).str.split('..'), sep=', ')
print(*[re.split('.', t) for t in texts], sep=', ')
print(*[re.split('..', t) for t in texts], sep=', ')
print()

# str.contains vs re.search: '.' is treated as regex by both
print(*pd.Series(texts).str.contains('.'), sep=', ')
print(*pd.Series(texts).str.contains('..'), sep=', ')
print(*[re.search('.', t).start() for t in texts], sep=', ')
print(*[re.search('..', t).start() for t in texts], sep=', ')

Problem description

Currently some pandas string functions (eg str.replace, str.split) treat "." as a regex matching any character, while some others (eg str.contains) treat it like "[.]", the regex matching the "dot" character. The ".." string is instead always correctly handled as the regex matching any two characters.

Expected Output

pandas string functions (when in regex mode) should treat "." consistently as the regex matching any single character. If there is a reason to treat as an exception, or to do so only for some string functions, should be documented.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : db08276bc116c438d3fdee492026f8223584c477 python : 3.8.5.final.0 python-bits : 64 OS : Darwin OS-release : 19.6.0 Version : Darwin Kernel Version 19.6.0: Thu Jun 18 20:49:00 PDT 2020; root:xnu-6153.141.1~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 1.1.3 numpy : 1.19.2 pytz : 2020.1 dateutil : 2.8.1 pip : 20.2.4 setuptools : 50.3.1.post20201107 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 7.18.1 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
jreback commented 4 years ago

there are a lot of very similiar issues eg https://github.com/pandas-dev/pandas/issues/24804

a single character is treated as not a regex. i suppose could be documented (i think it is in places).

PRs to fix are welcome.