Open hvr opened 4 years ago
Knowing nothing about what causes this issue, something to note is that regex-tdfa
supplies SPECIALIZE
pragmas for String
, Seq Char
, and both strict and lazy ByteString
. Notably, it does not provide SPECIALIZE
pragmas for strict or lazy Text
.
With that it mind, do you still see this issue after using the code in PR #8?
Knowing nothing about what causes this issue, something to note is that
regex-tdfa
suppliesSPECIALIZE
pragmas forString
, both strict and lazyByteString
, andSeq Char
. Notably, it does not provideSPECIALIZE
pragmas for strict or lazyText
.With that it mind, do you still see this issue after using the code in PR #8?
I'm afraid so.
I didn't have time yet to investigate more yet, but maybe you can?
The code below clearly shows the slowdown; its output for me is
== ByteString
92
email len : 0.575371484s
()
email rnf : 0.589730101s
== String
92
email len : 0.527972566s
()
email rnf : 0.538322764s
== Text
92
email len : 0.542759858s
()
email rnf : 29.904703049s
Counting the matches is equally fast ("email len"); the trouble starts when actually generating the match strings ("email rnf"), showing a huge 60-fold regression for the email regexp.
I commented out the other test-regexps as the email-regexp is enough to demonstrate the issue very clearly:
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# OPTIONS_GHC -Wall #-}
{- cabal:
build-depends: time, text, deepseq, bytestring, regex-tdfa, regex-base
-}
module Main where
import Control.DeepSeq
import Control.Exception
import Control.Monad ()
import qualified Data.ByteString as BS
import Data.Text (Text)
import qualified Data.Text as T
import qualified Data.Text.Encoding as T
import qualified Data.Text.IO as T
import Data.Time.Clock.POSIX
import Text.Regex.TDFA
main :: IO ()
main = do
!txt <- T.readFile "input-text.txt"
putStrLn "== ByteString"
!bs <- evaluate (T.encodeUtf8 txt)
doBench (bs :: BS.ByteString)
putStrLn "== String"
str <- evaluate (force (T.unpack txt))
doBench (str :: [Char])
putStrLn "== Text"
doBench (txt :: Text)
{-# NOINLINE doBench #-}
doBench :: forall text . (Show text, NFData text, RegexLike Regex text) => text -> IO ()
doBench txt = do
timeit "email len" $ print (length (getAllTextMatches (txt =~ reEmail) :: [text]))
-- timeit "URI len" $ print (length (getAllTextMatches (txt =~ reURI) :: [text]))
-- timeit "IPv4 len" $ print (length (getAllTextMatches (txt =~ reIPv4) :: [text]))
timeit "email rnf" $ print (rnf (getAllTextMatches (txt =~ reEmail) :: [text]))
-- timeit "URI rnf" $ print (rnf (getAllTextMatches (txt =~ reURI) :: [text]))
-- timeit "IPv4 rnf" $ print (rnf (getAllTextMatches (txt =~ reIPv4) :: [text]))
where
reEmail = "[_a-zA-Z0-9\\.+-]+@[_a-zA-Z0-9\\.-]+\\.[_a-zA-Z0-9\\.-]+"
-- reURI = "[_a-zA-Z0-9]+://[^/\\s?#]+[^\\s?#]+(\\?[^\\s#]*)?(#[^\\s]*)?"
-- reIPv4 = "((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])"
timeit :: String -> IO a -> IO a
timeit msg act = do
!t0 <- getPOSIXTime
!res <- act
!t1 <- getPOSIXTime
putStrLn (msg ++ " : " ++ show (t1-t0))
pure res
Hmm, that looks sticky. I have no particular knowledge that makes it likely that I can solve this problem, but if I have time I might have a poke around.
Is there any update on this? Or easy way I could help?
Background: I originally used text-tdfa on gwern.net for regexp tests & rewrites back in 2019 or so with the Text support, but it was so slow I moved to the basic Posix regex library; unfortunately, over the past 2 years I've been running into ever more segfaults and 'strange closures' (GHC would even segfault while compiling the code!), which generally seemed to be related to the regular regexes, which have a long trail of weird issues being reported over the years. Moving back to tdfa (with the regex-compat-tdfa
compat layer for the rewrites) mostly seems to resolve the constant segfaulting, but also slows down site compile time by like 2x (from ~1.5h to >3h). Before I spend a lot of time optimizing by hand, it'd be nice if text-tdfa fixed its lowhanging fruit.
@gwern : I am not planning on working on this, but a high-quality PR is definitely welcome.
My main concern is to keep regex-tdfa
working stably in an evolving Haskell ecosystem. But improvements are welcome...
One thing to note is that there are some pretty big changes in the text package which will be landing soon: https://github.com/haskell/text/pull/365, https://github.com/haskell/text/pull/348. The second in particular might have a large effect on any work here. Maybe disabling those rules will solve the problem? Maybe it will not, but will defeat any solution which is derived in the meantime. I think looking at the effect of those would be a good first thing to do.
Have those already been released? I see on Hackage the last text
upload date is 2021-12-24 but the UTF8 change was merged 2021-09-08 and the implicit-fusion disabling 2021-06-21, if I'm reading these bug reports correctly, so shouldn't both of them already be in live text
, not just dev HEAD?
It's complicated slightly by the fact that text
is a boot package, and so is bundled with ghc. You can access it now on hackage, but it won't be the default version you get when you compile (without some extra contortions) until ghc-9.4.
When I tried benchmarking, I noticed a strange and significant performance-penalty effect that is exposed by the
Text
support recently merged intoregex-tdfa
. (TheString
andByteString
instances don't exhibit this performance effect)It turns out that the
regex-tdfa-text
package already suffered from this but nobody seems to have reported it yet and it's probably not too apparent in simple regex-matching applications. Since theText
support is fundamentally more or less the same code as theByteString
implementation, I'm suspecting a weird interaction with the fusion-rules intext
.TODO: Provide repro-case