haskell-github-trust / replace-attoparsec

Stream editing with Haskell Attoparsec parsers
BSD 2-Clause "Simplified" License
14 stars 2 forks source link
attoparsec find-and-replace haskell haskell-library stream-editor

replace-attoparsec

Hackage Stackage Nightly Stackage LTS

replace-attoparsec is for finding text patterns, and also replacing or splitting on the found patterns. This activity is traditionally done with regular expressions, but replace-attoparsec uses attoparsec parsers instead for the pattern matching.

replace-attoparsec can be used in the same sort of “pattern capture” or “find all” situations in which one would use Python re.findall or Perl m//, or Unix grep.

replace-attoparsec can be used in the same sort of “stream editing” or “search-and-replace” situations in which one would use Python re.sub, or Perl s///, or Unix sed, or awk.

replace-attoparsec can be used in the same sort of “string splitting” situations in which one would use Python re.split or Perl split.

See replace-megaparsec for the megaparsec version.

Why would we want to do pattern matching and substitution with parsers instead of regular expressions?

Usage Examples

Try the examples in ghci by running cabal v2-repl in the replace-attoparsec/ root directory.

The examples depend on these imports and LANGUAGE OverloadedStrings.

:set -XOverloadedStrings
import Replace.Attoparsec.Text
import Data.Attoparsec.Text as AT
import qualified Data.Text as T
import Control.Applicative
import Data.Either
import Data.Char

Split strings with splitCap

Find all pattern matches, capture the matched text and the parsed result

Separate the input string into sections which can be parsed as a hexadecimal number with a prefix "0x", and sections which can't. Parse the numbers.

let hexparser = string "0x" *> hexadecimal :: Parser Integer
splitCap (match hexparser) "0xA 000 0xFFFF"
[Right ("0xA",10), Left " 000 ", Right ("0xFFFF",65535)]

Pattern match balanced parentheses

Find groups of balanced nested parentheses. This is an example of a “context-free” grammar, a pattern that can't be expressed by a regular expression. We can express the pattern with a recursive parser.

import Data.Functor (void)
import Data.Bifunctor (second)
let parens :: Parser ()
    parens = do
        char '('
        manyTill
            (void parens <|> void anyChar)
            (char ')')
        pure ()

second fst <$> splitCap (match parens) "(()) (()())"
[Right "(())",Left " ",Right "(()())"]

Edit text strings with streamEdit

The following examples show how to search for a pattern in a string of text and then edit the string of text to substitute in some replacement text for the matched patterns.

Pattern match and replace with a constant

Replace all carriage-return-newline occurances with newline.

streamEdit (string "\r\n") (const "\n") "1\r\n2\r\n"
"1\n2\n"

Pattern match and edit the matches

Replace alphabetic characters with the next character in the alphabet.

streamEdit (AT.takeWhile isLetter) (T.map succ) "HAL 9000"
"IBM 9000"

Pattern match and maybe edit the matches, or maybe leave them alone

Find all of the string sections s which can be parsed as a hexadecimal number r, and if r≤16, then replace s with a decimal number. Uses the match combinator.

let hexparser = string "0x" *> hexadecimal :: Parser Integer
streamEdit (match hexparser) (\(s,r) -> if r <= 16 then T.pack (show r) else s) "0xA 000 0xFFFF"
"10 000 0xFFFF"

Pattern match and edit the matches with IO with streamEditT

Find an environment variable in curly braces and replace it with its value from the environment.

import System.Environment (getEnv)
streamEditT (char '{' *> manyTill anyChar (char '}')) (fmap T.pack . getEnv) "- {HOME} -"
"- /home/jbrock -"

Pattern match, edit the matches, and count the edits with streamEditT

Find and capitalize no more than three letters in a string, and return the edited string along with the number of letters capitalized. To enable the editor function to remember how many letters it has capitalized, we'll run streamEditT in the State monad from the mtl package. Use this technique to get the same functionality as Python re.subn.

import qualified Control.Monad.State.Strict as MTL
import Control.Monad.State.Strict (get, put, runState)
import Data.Char (toUpper)

let editThree :: Char -> MTL.State Int T.Text
    editThree x = do
        i <- get
        if i<3
            then do
                put $ i+1
                pure $ T.singleton $ toUpper x
            else pure $ T.singleton x

flip runState 0 $ streamEditT (satisfy isLetter) editThree "a a a a a"
("A A A a a",3)

In the Shell

If we're going to have a viable sed replacement then we want to be able to use it easily from the command line. This Stack script interpreter script will find decimal numbers in a stream and replace them with their double.

#!/usr/bin/env stack
{- stack
  script
  --resolver lts-16
  --package attoparsec
  --package text
  --package text-show
  --package replace-attoparsec
-}
-- https://docs.haskellstack.org/en/stable/GUIDE/#script-interpreter

{-# LANGUAGE OverloadedStrings #-}

import qualified Data.Text as T
import qualified Data.Text.IO as T
import TextShow
import Data.Attoparsec.Text
import Replace.Attoparsec.Text

main = T.interact $ streamEdit decimal (showt . (* (2::Integer)))

If you have The Haskell Tool Stack installed then you can just copy-paste this into a file named doubler.hs and run it. (On the first run Stack may need to download the dependencies.)

$ chmod u+x doubler.hs
$ echo "1 6 21 107" | ./doubler.hs
2 12 42 214

Alternatives

Some libraries that one might consider instead of this one.

http://hackage.haskell.org/package/regex-applicative

http://hackage.haskell.org/package/pcre-heavy

http://hackage.haskell.org/package/lens-regex-pcre

http://hackage.haskell.org/package/regex

http://hackage.haskell.org/package/pipes-parse

http://hackage.haskell.org/package/stringsearch

http://hackage.haskell.org/package/substring-parser

http://hackage.haskell.org/package/pcre-utils

http://hackage.haskell.org/package/template

https://github.com/RaminHAL9001/parser-sed-thing

http://hackage.haskell.org/package/attosplit

Benchmarks

These benchmarks are intended to measure the wall-clock speed of everything except the actual pattern-matching. Speed of the pattern-matching is the responsibility of the megaparsec and attoparsec libraries.

The benchmark task is to find all of the one-character patterns x in a text stream and replace them by a function which returns the constant string oo. So, like the regex s/x/oo/g.

We have two benchmark input cases, which we call dense and sparse.

The dense case is ten megabytes of alternating spaces and xs like

x x x x x x x x x x x x x x x x x x x x x x x x x x x x

The sparse case is ten megabytes of spaces with a single x in the middle like

                         x

Each benchmark program reads the input from stdin, replaces x with oo, and writes the result to stdout. The time elapsed is measured by perf stat, and the best observed time is recorded.

See replace-benchmark for details.

Program dense ms sparse ms
Python 3.10.9 re.sub repl function 557.22 35.47
Perl v5.36.0 s///ge function 1208.66 12.61
Replace.Megaparsec.streamEdit String 2921.25 2911.81
Replace.Megaparsec.streamEdit ByteString 3743.25 757.21
Replace.Megaparsec.streamEdit Text 3818.47 881.69
Replace.Attoparsec.ByteString.streamEdit 3006.38 179.66
Replace.Attoparsec.Text.streamEdit 3062.43 300.13
Replace.Attoparsec.Text.Lazy.streamEdit 3102.15 241.58
Text.Regex.Applicative.replace String 13875.25 4330.52
Text.Regex.PCRE.Heavy.gsub Text 113.27
Control.Lens.Regex.ByteString.match 117.05
Control.Lens.Regex.Text.match 35.97

Hypothetically Asked Questions

  1. Could we write this library for parsec?

    No, because the match combinator doesn't exist for parsec. (I can't find it anywhere. Can it be written?)

  2. Is this a good idea?

    You may have heard it suggested that monadic parsers are better for pattern-matching when the input stream is mostly signal, and regular expressions are better when the input stream is mostly noise.

    The premise of this library is that monadic parsers are great for finding small signal patterns in a stream of otherwise noisy text.

    Our reluctance to forego the speedup opportunities afforded by restricting ourselves to regular grammars is an old superstition about opportunities which remain mostly unexploited anyway. The performance compromise of allowing stack memory allocation (a.k.a. pushdown automata, a.k.a. context-free grammar) was once considered controversial for general-purpose programming languages. I think we can now resolve that controversy the same way for pattern matching languages.