sfu-db / dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
http://dataprep.ai
MIT License
1.99k stars 203 forks source link

clean_phone module doesn't recognize e.164 extension format #850

Open yukewang1 opened 2 years ago

yukewang1 commented 2 years ago

Describe the bug

The E.164 standards state that phone numbers can be written in a format of +<CountryCode><City/AreaCode><LocalNumber>;ext=<ext>. An example could be +19052223333;ext=555. The current clean_phone() function doesn't recognize such numbers because this rule is not specified in the regex at line 16, clean_phone.py.

To Reproduce Steps to reproduce the behavior:

from dataprep.clean import clean_phone
import pandas as pd

df = pd.DataFrame({
    "phone": ["+19052223333;ext=555"]
})

clean_phone(df, "phone", output_format="e164")

Expected behavior The correct output should be +12345678901 ext. 1234 where as it doesn't regonize this format and outputs np.NaN.

Screenshots Screen Shot 2022-03-13 at 23 31 58

Desktop (please complete the following information):

Additional context Here's a blog explaining e.164 standards, specifically about how to specify an extension. Link

yixuy commented 2 years ago

Good catch! Thanks for your context, we will fix it soon!