Open asarkar opened 2 years ago
Could you submit a small Haskell program demonstrating the problem?
Then it would be easy to compare the behavior of regex-tdfa
to the other implementations, like regex-pcre
, regex-posix
etc.
Perhaps this will help, taken from my StackOverflow question.
module WordCount (wordCount) where
import qualified Data.Char as C
import qualified Data.List as L
import Text.Regex.TDFA as R
wordCount :: String -> [(String, Int)]
wordCount xs =
do
let zs = R.getAllTextMatches (xs =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
g <- L.group $ L.sort [map C.toLower w | w <- zs]
return (head g, length g)
What the others do:
regex-pcre
does what you want (finds the digits).regex-posix
finds no matchesConcerning regex-tdfa
, if you look up the documentation at https://hackage.haskell.org/package/regex-tdfa under section Special characters, \d
is not included. Thus, you should not be surprised it is not supported.
It even says explicitly:
regex-tdfa only supports a small set of special characters and is much less featureful than some other regex engines you might be used to, such as PCRE.
So, the easiest solution for you might be to use regex-pcre
.
(Not sure what your intention with filing this report was, maybe you want to PR.)
I found this library looking for a regex package, and saw it mentioned in the Haskell wiki, and in a blog that’s now part of the README. I compared various libraries based on their maintainability (last commit date) and popularity (GitHub stars, issues addressed promptly), and this one came out at the top. Because of that, I’m indeed surprised that something as common as `\d‘ isn’t supported. I’m a Haskell freshman and don’t have the skills yet to start making PRs on a general-purpose library.
Predefined character classes we could support are listed here: https://en.wikipedia.org/w/index.php?title=Regular_expression§ion=13#Character_classes
One could recognize them either directly in the parser: https://github.com/haskell-hvr/regex-tdfa/blob/95d47cb982d2cf636b2cb6260a866f9907341c45/lib/Text/Regex/TDFA/ReadRegex.hs#L94 Maybe it is better to handle them in the translation: https://github.com/haskell-hvr/regex-tdfa/blob/95d47cb982d2cf636b2cb6260a866f9907341c45/lib/Text/Regex/TDFA/CorePattern.hs#L537-L538
There seems to be already code for POSIX character classes:
https://github.com/haskell-hvr/regex-tdfa/blob/95d47cb982d2cf636b2cb6260a866f9907341c45/lib/Text/Regex/TDFA/TNFA.hs#L798-L805
These can be given to Pattern
s PAny
and PAnyNot
:
https://github.com/haskell-hvr/regex-tdfa/blob/95d47cb982d2cf636b2cb6260a866f9907341c45/lib/Text/Regex/TDFA/Pattern.hs#L45-L46
@asarkar: The syntax accepted by regex-tdfa
is [[:digit:]]
instead of \d
. See https://regex101.com/r/griuTm/1 for your whole regex.
https://regex101.com/r/griuTm/1 shows \d
, is that the correct link?
regex101.com/r/griuTm/1 shows
\d
, is that the correct link?
No, \d
should be replaced by [[:digit:]]
. I updated the regex, but the link didn't update.
Supporting Perl-style regexes like \d
would not be hard to implement, but it would be a backward-incompatible change, because currently \d
means simply d
. So, I am not sure whether it is worth it. While \d
is quicker to type, [[:digit:]]
is easier to comprehend if you look at a regex. What is your application of regex-tdfa
?
I intend to use regex-tdfa
to solve some exercises from https://exercism.org/tracks/haskell. An alternative is using a parser combinator like Megaparsec, that is significantly harder.
For example, given below is a question that I previously solved in Rust using a regex library. The pattern I used was (?:[2-9][0-9]{2}){2}(?:[0-9]{4})
.
They have a predefined list of packages they allow; regex-tdfa
is not currently in that list, but I've submitted a PR to get it included.
If you're reluctant in making this change, and I'm not talking about \d
only, I'll be happy to use any other regex package, but like I said before, it doesn't seem like there are a lot of great options.
Clean up user-entered phone numbers so that they can be sent SMS messages.
The North American Numbering Plan (NANP) is a telephone numbering system used by many countries in North America like the United States, Canada or Bermuda. All NANP-countries share the same international country code: 1.
NANP numbers are ten-digit numbers consisting of a three-digit Numbering Plan Area code, commonly known as area code, followed by a seven-digit local number. The first three digits of the local number represent the exchange code, followed by the unique four-digit number which is the subscriber number.
The format is usually represented as
(NXX)-NXX-XXXX where N is any digit from 2 through 9 and X is any digit from 0 through 9.
Your task is to clean up differently formatted telephone numbers by removing punctuation and the country code (1) if present.
For example, the inputs
+1 (613)-995-0253 613-995-0253 1 613 995 0253 613.995.0253 should all produce the output
6139950253
Note: As this exercise only deals with telephone numbers used in NANP-countries, only 1 is considered a valid country code.
This exercise would be https://exercism.org/tracks/haskell/exercises/phone-number .
Please bear with me, I still have trouble understanding the importance of supporting \d
etc.
For example, given below is a question that I previously solved in Rust using a regex library. The pattern I used was
(?:[2-9][0-9]{2}){2}(?:[0-9]{4})
.
Ok, but this should be fine, as \d
is here spelled out as [0-9]
.
They have a predefined list of packages they allow;
regex-tdfa
is not currently in that list, but I've submitted a PR to get it included.
Please share the link to the PR if that's fine with you.
Would supporting \d
etc. be a requirement to have regex-tdfa
included?
Would supporting \d etc. be a requirement to have regex-tdfa included?
No, the PR's been merged. https://github.com/exercism/haskell-test-runner/pull/52
the importance of supporting \d etc.
The importance, at least to me, is brevity and conciseness. If, in your opinion, what I said so far doesn't justify the change, I've nothing further to add to this discussion. Please make a decision, and either proceed to implement this ticket, or don't, I'm going to get my coat.
Ok, thanks for your input, @asarkar ! I need to balance between convenience and stability. I'll leave this open and see if other users chime in.
Pattern
\\d+|\\b[a-zA-Z']+\\b
fails to find the digits in input "testing, 1, 2 testing". The regex is correct as can be tested here https://regex101.com/r/griuTm/1.Changing the pattern to
\\b[0-9a-zA-Z']+\\b
works, but it changes the intent because that makes input "123abc" would be valid.\\b[0-9]+\\b|\\b[a-zA-Z']+\\b
works too.