haskell / attoparsec

A fast Haskell library for parsing ByteStrings
http://hackage.haskell.org/package/attoparsec
Other
514 stars 93 forks source link

Parser makes Ubuntu crash on 1.2G file #130

Open apraga opened 7 years ago

apraga commented 7 years ago

Hi,

I've implemented a parser using Attoparsec, which works very well. Unfortunately, for a large file (one of 1.2Go), running the parser makes it crash on my Ubuntu 16.04.2 LTS. I'm using attoparsec 0.13.1.0 with stack.

Below is the complete code for the parser. As a example, a small file is also given to have an idea of the file format. small_test.txt

If someone is interested, I can give the large file making the parser crash. Thanks.

{-# LANGUAGE DeriveDataTypeable, OverloadedStrings #-}
import Control.Applicative
import Control.Monad (void)
import Data.List
import Data.Scientific as S hiding (scientific)
import Data.Text.Lazy as T hiding (map, count)
import Data.Text.Lazy.IO as TIO
import Prelude hiding (exponent, id)
import Data.Attoparsec.Text.Lazy
import System.Console.CmdArgs
import System.Environment

-- Reading ploc using Attoparsec : fast but not helpful error messages.
-- For debug, use parseTest and ghci for each component.
--
-- The file format is 
-- TIME
-- HEADER
-- [PARTICLE]
--
-- with 
--
-- TIME = realtime = FLOAT [gamt = FLOAT]
-- HEADER = PART # XX YY ANGZ | ZZ  ALPHA BETA GAMMA ADX ADY ADZ
-- PARTICLE = INT FLOAT*9
data Particle = Particle {
  id :: Integer,
  pos :: [Scientific],
  ad :: [Integer]
}

data Iteration = Iteration {
  realtime :: Scientific,
  particles :: [Particle]
}

toText :: Show a => a -> T.Text
toText = T.pack . show

addComma x = T.intercalate "," $ map toText x

printPart :: Particle -> T.Text
printPart (Particle i p a) =  T.intercalate "," l
    where l = [toText i, addComma p, addComma a]

printIter :: Iteration -> T.Text
printIter (Iteration t p) = T.intercalate "\n" $ map format p
      where format x = T.concat [toText t, ",", printPart x]

signedInt :: Parser Integer
signedInt = signed decimal

mySep1 = some $ char ' '

mySep = many space 

gamt = mySep >> asciiCI "gamt =" >> mySep >> scientific

-- time :: Parser Scientific
time = do
  mySep >> asciiCI "REALTIME =" >> mySep 
  t <- scientific 
  t' <- option 0 gamt
  return t

-- Helper
stringify x = mySep >> asciiCI x

-- Two headers are possible : the 5th column can be "zz" or "ANGZ"
header = do 
  mapM_ stringify header0 
  mySep *> (asciiCI "zz" <|> asciiCI "ANGZ")
  mapM_ stringify header1 
  where 
    header0 = [ "PART", "#" , "XX", "YY"]
    header1 = [ "ALPHA", "BETA", "GAMMA" , "ADX", "ADY", "ADZ"]

-- Read a particle coordinates
part :: Parser Particle
part = do
  id <- mySep >> decimal <* mySep1
  coord <- count 6 (scientific <* mySep1)
  asd <-  sepBy signedInt mySep1 
  return $ Particle id coord asd

emptyLine = mySep >> endOfLine

 -- Read an iteration
iter :: Parser Iteration
iter = do 
  t <- time <* endOfLine
  header  >> endOfLine
  allPart <- sepBy part endOfLine
  return $ Iteration t allPart

parseExpr = space >> sepBy iter space 

readExpr input = case eitherResult . parse parseExpr $ input of
  Left err -> error "failed to read"
  Right val -> val
-- 
data ParserArgs = ParserArgs { input :: String
                             , output :: FilePath } 
                   deriving (Show, Data, Typeable)

parserArgs = ParserArgs { 
                input = def &= argPos 0 &= typ "INPUT"
                , output = def &= argPos 1 &= typ "OUTPUT"
                }

main = do
  args <- cmdArgs parserArgs
  txt <- TIO.readFile $ input args
  let d = readExpr txt
  let result = T.intercalate "\n" $ map printIter d
  TIO.writeFile (output args) result
  print "done"
bgamari commented 7 years ago

What precisely do you mean by crash? Keep in mind that heap representations (especially your particular Particle representation) are generally larger than their on-disk representation. Are you certain you aren't simply running out of memory?

apraga commented 7 years ago

Thanks for the quick answer. By "crash", I mean the computer freezes and becomes unresponsive.

I've monitored memory usage and you are right, I'm running out of memory. Is there a way to decrease memory usage of my program ?

bgamari commented 7 years ago

Looking at your program, a few things stand out:

bgamari commented 7 years ago

@alexDarcy, see https://github.com/bgamari/memory-reduction for a few examples. Come find me in #haskell on irc.freenode.net if you want to chat about your problem.