haskell-hvr / regex-tdfa

Pure Haskell Tagged DFA Backend for "Text.Regex" (regex-base)
http://hackage.haskell.org/package/regex-tdfa
Other
36 stars 9 forks source link

Performance issue w/ new `Text` instances #9

Open hvr opened 4 years ago

hvr commented 4 years ago

When I tried benchmarking, I noticed a strange and significant performance-penalty effect that is exposed by the Text support recently merged into regex-tdfa. (The String and ByteString instances don't exhibit this performance effect)

It turns out that the regex-tdfa-text package already suffered from this but nobody seems to have reported it yet and it's probably not too apparent in simple regex-matching applications. Since the Text support is fundamentally more or less the same code as the ByteString implementation, I'm suspecting a weird interaction with the fusion-rules in text.

TODO: Provide repro-case

Xitian9 commented 4 years ago

Knowing nothing about what causes this issue, something to note is that regex-tdfa supplies SPECIALIZE pragmas for String, Seq Char, and both strict and lazy ByteString. Notably, it does not provide SPECIALIZE pragmas for strict or lazy Text.

With that it mind, do you still see this issue after using the code in PR #8?

hvr commented 4 years ago

Knowing nothing about what causes this issue, something to note is that regex-tdfa supplies SPECIALIZE pragmas for String, both strict and lazy ByteString, and Seq Char. Notably, it does not provide SPECIALIZE pragmas for strict or lazy Text.

With that it mind, do you still see this issue after using the code in PR #8?

I'm afraid so.

I didn't have time yet to investigate more yet, but maybe you can?

The code below clearly shows the slowdown; its output for me is

== ByteString
92
email len : 0.575371484s
()
email rnf : 0.589730101s
== String
92
email len : 0.527972566s
()
email rnf : 0.538322764s
== Text
92
email len : 0.542759858s
()
email rnf : 29.904703049s

Counting the matches is equally fast ("email len"); the trouble starts when actually generating the match strings ("email rnf"), showing a huge 60-fold regression for the email regexp.

I commented out the other test-regexps as the email-regexp is enough to demonstrate the issue very clearly:

{-# LANGUAGE BangPatterns        #-}
{-# LANGUAGE FlexibleContexts    #-}
{-# LANGUAGE ScopedTypeVariables #-}

{-# OPTIONS_GHC -Wall #-}

{- cabal:

build-depends: time, text, deepseq, bytestring, regex-tdfa, regex-base

-}

module Main where

import           Control.DeepSeq
import           Control.Exception
import           Control.Monad         ()
import qualified Data.ByteString       as BS
import           Data.Text             (Text)
import qualified Data.Text             as T
import qualified Data.Text.Encoding    as T
import qualified Data.Text.IO          as T
import           Data.Time.Clock.POSIX
import           Text.Regex.TDFA

main :: IO ()
main = do
    !txt <- T.readFile "input-text.txt"

    putStrLn "== ByteString"
    !bs <- evaluate (T.encodeUtf8 txt)
    doBench (bs :: BS.ByteString)

    putStrLn "== String"
    str <- evaluate (force (T.unpack txt))
    doBench (str :: [Char])

    putStrLn "== Text"
    doBench (txt :: Text)

{-# NOINLINE doBench #-}
doBench :: forall text . (Show text, NFData text, RegexLike Regex text) => text -> IO ()
doBench txt = do
    timeit "email len" $ print (length (getAllTextMatches (txt =~ reEmail) :: [text]))
--  timeit "URI   len" $ print (length (getAllTextMatches (txt =~ reURI)   :: [text]))
--  timeit "IPv4  len" $ print (length (getAllTextMatches (txt =~ reIPv4)  :: [text]))
    timeit "email rnf" $ print (rnf (getAllTextMatches (txt =~ reEmail) :: [text]))
--  timeit "URI   rnf" $ print (rnf (getAllTextMatches (txt =~ reURI)   :: [text]))
--  timeit "IPv4  rnf" $ print (rnf (getAllTextMatches (txt =~ reIPv4)  :: [text]))
  where
    reEmail = "[_a-zA-Z0-9\\.+-]+@[_a-zA-Z0-9\\.-]+\\.[_a-zA-Z0-9\\.-]+"
--  reURI   = "[_a-zA-Z0-9]+://[^/\\s?#]+[^\\s?#]+(\\?[^\\s#]*)?(#[^\\s]*)?"
--  reIPv4  = "((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])"

timeit :: String -> IO a -> IO a
timeit msg act = do
  !t0 <- getPOSIXTime
  !res <- act
  !t1 <- getPOSIXTime
  putStrLn (msg ++ " : " ++ show (t1-t0))
  pure res
Xitian9 commented 4 years ago

Hmm, that looks sticky. I have no particular knowledge that makes it likely that I can solve this problem, but if I have time I might have a poke around.

gwern commented 2 years ago

Is there any update on this? Or easy way I could help?

Background: I originally used text-tdfa on gwern.net for regexp tests & rewrites back in 2019 or so with the Text support, but it was so slow I moved to the basic Posix regex library; unfortunately, over the past 2 years I've been running into ever more segfaults and 'strange closures' (GHC would even segfault while compiling the code!), which generally seemed to be related to the regular regexes, which have a long trail of weird issues being reported over the years. Moving back to tdfa (with the regex-compat-tdfa compat layer for the rewrites) mostly seems to resolve the constant segfaulting, but also slows down site compile time by like 2x (from ~1.5h to >3h). Before I spend a lot of time optimizing by hand, it'd be nice if text-tdfa fixed its lowhanging fruit.

andreasabel commented 2 years ago

@gwern : I am not planning on working on this, but a high-quality PR is definitely welcome.
My main concern is to keep regex-tdfa working stably in an evolving Haskell ecosystem. But improvements are welcome...

Xitian9 commented 2 years ago

One thing to note is that there are some pretty big changes in the text package which will be landing soon: https://github.com/haskell/text/pull/365, https://github.com/haskell/text/pull/348. The second in particular might have a large effect on any work here. Maybe disabling those rules will solve the problem? Maybe it will not, but will defeat any solution which is derived in the meantime. I think looking at the effect of those would be a good first thing to do.

gwern commented 2 years ago

Have those already been released? I see on Hackage the last text upload date is 2021-12-24 but the UTF8 change was merged 2021-09-08 and the implicit-fusion disabling 2021-06-21, if I'm reading these bug reports correctly, so shouldn't both of them already be in live text, not just dev HEAD?

Xitian9 commented 2 years ago

It's complicated slightly by the fact that text is a boot package, and so is bundled with ghc. You can access it now on hackage, but it won't be the default version you get when you compile (without some extra contortions) until ghc-9.4.