haskell-hvr / regex-tdfa

Pure Haskell Tagged DFA Backend for "Text.Regex" (regex-base)
http://hackage.haskell.org/package/regex-tdfa
Other
38 stars 9 forks source link

Perl-style shorthands (like `\d`) not recognized, only POSIX ones (like `[[:digit:]]`) #36

Open asarkar opened 2 years ago

asarkar commented 2 years ago

Pattern \\d+|\\b[a-zA-Z']+\\b fails to find the digits in input "testing, 1, 2 testing". The regex is correct as can be tested here https://regex101.com/r/griuTm/1.

Changing the pattern to \\b[0-9a-zA-Z']+\\b works, but it changes the intent because that makes input "123abc" would be valid. \\b[0-9]+\\b|\\b[a-zA-Z']+\\b works too.

andreasabel commented 2 years ago

Could you submit a small Haskell program demonstrating the problem? Then it would be easy to compare the behavior of regex-tdfa to the other implementations, like regex-pcre, regex-posix etc.

asarkar commented 2 years ago

Perhaps this will help, taken from my StackOverflow question.

module WordCount (wordCount) where

import qualified Data.Char as C
import qualified Data.List as L
import Text.Regex.TDFA as R

wordCount :: String -> [(String, Int)]
wordCount xs =
  do
    let zs = R.getAllTextMatches (xs =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
    g <- L.group $ L.sort [map C.toLower w | w <- zs]
    return (head g, length g)
andreasabel commented 2 years ago

What the others do:

Concerning regex-tdfa, if you look up the documentation at https://hackage.haskell.org/package/regex-tdfa under section Special characters, \d is not included. Thus, you should not be surprised it is not supported.

It even says explicitly:

regex-tdfa only supports a small set of special characters and is much less featureful than some other regex engines you might be used to, such as PCRE.

So, the easiest solution for you might be to use regex-pcre.
(Not sure what your intention with filing this report was, maybe you want to PR.)

asarkar commented 2 years ago

I found this library looking for a regex package, and saw it mentioned in the Haskell wiki, and in a blog that’s now part of the README. I compared various libraries based on their maintainability (last commit date) and popularity (GitHub stars, issues addressed promptly), and this one came out at the top. Because of that, I’m indeed surprised that something as common as `\d‘ isn’t supported. I’m a Haskell freshman and don’t have the skills yet to start making PRs on a general-purpose library.

andreasabel commented 2 years ago

Predefined character classes we could support are listed here: https://en.wikipedia.org/w/index.php?title=Regular_expression&section=13#Character_classes

One could recognize them either directly in the parser: https://github.com/haskell-hvr/regex-tdfa/blob/95d47cb982d2cf636b2cb6260a866f9907341c45/lib/Text/Regex/TDFA/ReadRegex.hs#L94 Maybe it is better to handle them in the translation: https://github.com/haskell-hvr/regex-tdfa/blob/95d47cb982d2cf636b2cb6260a866f9907341c45/lib/Text/Regex/TDFA/CorePattern.hs#L537-L538

andreasabel commented 2 years ago

There seems to be already code for POSIX character classes: https://github.com/haskell-hvr/regex-tdfa/blob/95d47cb982d2cf636b2cb6260a866f9907341c45/lib/Text/Regex/TDFA/TNFA.hs#L798-L805 These can be given to Patterns PAny and PAnyNot: https://github.com/haskell-hvr/regex-tdfa/blob/95d47cb982d2cf636b2cb6260a866f9907341c45/lib/Text/Regex/TDFA/Pattern.hs#L45-L46

@asarkar: The syntax accepted by regex-tdfa is [[:digit:]] instead of \d. See https://regex101.com/r/griuTm/1 for your whole regex.

asarkar commented 2 years ago

https://regex101.com/r/griuTm/1 shows \d, is that the correct link?

andreasabel commented 2 years ago

regex101.com/r/griuTm/1 shows \d, is that the correct link?

No, \d should be replaced by [[:digit:]]. I updated the regex, but the link didn't update.

Supporting Perl-style regexes like \d would not be hard to implement, but it would be a backward-incompatible change, because currently \d means simply d. So, I am not sure whether it is worth it. While \d is quicker to type, [[:digit:]] is easier to comprehend if you look at a regex. What is your application of regex-tdfa?

asarkar commented 2 years ago

I intend to use regex-tdfa to solve some exercises from https://exercism.org/tracks/haskell. An alternative is using a parser combinator like Megaparsec, that is significantly harder.

For example, given below is a question that I previously solved in Rust using a regex library. The pattern I used was (?:[2-9][0-9]{2}){2}(?:[0-9]{4}).

They have a predefined list of packages they allow; regex-tdfa is not currently in that list, but I've submitted a PR to get it included.

If you're reluctant in making this change, and I'm not talking about \d only, I'll be happy to use any other regex package, but like I said before, it doesn't seem like there are a lot of great options.

Clean up user-entered phone numbers so that they can be sent SMS messages.

The North American Numbering Plan (NANP) is a telephone numbering system used by many countries in North America like the United States, Canada or Bermuda. All NANP-countries share the same international country code: 1.

NANP numbers are ten-digit numbers consisting of a three-digit Numbering Plan Area code, commonly known as area code, followed by a seven-digit local number. The first three digits of the local number represent the exchange code, followed by the unique four-digit number which is the subscriber number.

The format is usually represented as

(NXX)-NXX-XXXX where N is any digit from 2 through 9 and X is any digit from 0 through 9.

Your task is to clean up differently formatted telephone numbers by removing punctuation and the country code (1) if present.

For example, the inputs

+1 (613)-995-0253 613-995-0253 1 613 995 0253 613.995.0253 should all produce the output

6139950253

Note: As this exercise only deals with telephone numbers used in NANP-countries, only 1 is considered a valid country code.

andreasabel commented 2 years ago

This exercise would be https://exercism.org/tracks/haskell/exercises/phone-number .

Please bear with me, I still have trouble understanding the importance of supporting \d etc.

For example, given below is a question that I previously solved in Rust using a regex library. The pattern I used was (?:[2-9][0-9]{2}){2}(?:[0-9]{4}).

Ok, but this should be fine, as \d is here spelled out as [0-9].

They have a predefined list of packages they allow; regex-tdfa is not currently in that list, but I've submitted a PR to get it included.

Please share the link to the PR if that's fine with you.

Would supporting \d etc. be a requirement to have regex-tdfa included?

asarkar commented 2 years ago

Would supporting \d etc. be a requirement to have regex-tdfa included?

No, the PR's been merged. https://github.com/exercism/haskell-test-runner/pull/52

the importance of supporting \d etc.

The importance, at least to me, is brevity and conciseness. If, in your opinion, what I said so far doesn't justify the change, I've nothing further to add to this discussion. Please make a decision, and either proceed to implement this ticket, or don't, I'm going to get my coat.

andreasabel commented 2 years ago

Ok, thanks for your input, @asarkar ! I need to balance between convenience and stability. I'll leave this open and see if other users chime in.