jfilter / clean-text

🧹 Python package for text cleaning
Other
958 stars 79 forks source link

phone numbers with two digit area code not recognized #10

Open cod3licious opened 4 years ago

cod3licious commented 4 years ago

this: +1 123 1548690 is correctly identified as a phone number, but not this: +49 123 1548690

cod3licious commented 4 years ago

At the top here are some nice regexs, incl. this one for phone numbers:

    r"""
    (?:
      (?:            # (international)
        \+?[01]
        [ *\-.\)]*
      )?
      (?:            # (area code)
        [\(]?
        \d{3}
        [ *\-.\)]*
      )?
      \d{3}          # exchange
      [ *\-.\)]*
      \d{4}          # base
    )"""

maybe this fixes it?

cod3licious commented 4 years ago

ok, I think this might work: r"(?:^|(?<=[^\w)]))(((\+?[01])|(\+\d{2}))[ .-]?)?(\(?\d{3}\)?[ .-]?)?(\d{3}[ .-]?\d{4})(\s?(?:ext\.?|[#x-])\s?\d{2,6})?(?:$|(?=\W))"

AssassinTee commented 4 years ago
phone_numbers = [
    "2404 9099130",
    "024049099130",
    "02404 9099130",
    "02404/9099130",
    "+492404 9099130",
    "+4924049099130",
    "+492404/9099130",
    "0160 123456789",
    "0160/123456789",
    "+32160 123456789",
    "Tel.: 0160 123456789"
]

for i, number in enumerate(phone_numbers):
    print(f"{i}: {text_cleaner.transform(number)}")
0: 2404 <phone>
1: 024049099130
2: 02404 <phone>
3: 02404/<phone>
4: +492404 <phone>
5: +4924049099130
6: +492404/<phone>
7: 0160 123456789
8: 0160/123456789
9: +32160 123456789
10: tel.: 0160 123456789

:(

jfilter commented 4 years ago

Thanks @cod3licious for providing the regex and thanks @AssassinTee for the test cases. I adapted the regex to make it work with all the provided phone numbers.

rhnfzl commented 1 year ago

The regex doesn't work with phone numbers like

001-504-724-7835x2050
001-687-915-1144
001-507-783-9793x4107