Parser makes Ubuntu crash on 1.2G file #130

apraga commented 7 years ago


I've implemented a parser using Attoparsec, which works very well. Unfortunately, for a large file (one of 1.2Go), running the parser makes it crash on my Ubuntu 16.04.2 LTS. I'm using attoparsec with stack.

Below is the complete code for the parser. As a example, a small file is also given to have an idea of the file format. small_test.txt

If someone is interested, I can give the large file making the parser crash. Thanks.

{-# LANGUAGE DeriveDataTypeable, OverloadedStrings #-}
import Control.Applicative
import Control.Monad (void)
import Data.List
import Data.Scientific as S hiding (scientific)
import Data.Text.Lazy as T hiding (map, count)
import Data.Text.Lazy.IO as TIO
import Prelude hiding (exponent, id)
import Data.Attoparsec.Text.Lazy
import System.Console.CmdArgs
import System.Environment

-- Reading ploc using Attoparsec : fast but not helpful error messages.
-- For debug, use parseTest and ghci for each component.
-- The file format is 
-- with 
-- TIME = realtime = FLOAT [gamt = FLOAT]
data Particle = Particle {
  id :: Integer,
  pos :: [Scientific],
  ad :: [Integer]

data Iteration = Iteration {
  realtime :: Scientific,
  particles :: [Particle]

toText :: Show a => a -> T.Text
toText = T.pack . show

addComma x = T.intercalate "," $ map toText x

printPart :: Particle -> T.Text
printPart (Particle i p a) =  T.intercalate "," l
    where l = [toText i, addComma p, addComma a]

printIter :: Iteration -> T.Text
printIter (Iteration t p) = T.intercalate "\n" $ map format p
      where format x = T.concat [toText t, ",", printPart x]

signedInt :: Parser Integer
signedInt = signed decimal

mySep1 = some $ char ' '

mySep = many space 

gamt = mySep >> asciiCI "gamt =" >> mySep >> scientific

-- time :: Parser Scientific
time = do
  mySep >> asciiCI "REALTIME =" >> mySep 
  t <- scientific 
  t' <- option 0 gamt
  return t

-- Helper
stringify x = mySep >> asciiCI x

-- Two headers are possible : the 5th column can be "zz" or "ANGZ"
header = do 
  mapM_ stringify header0 
  mySep *> (asciiCI "zz" <|> asciiCI "ANGZ")
  mapM_ stringify header1 
    header0 = [ "PART", "#" , "XX", "YY"]
    header1 = [ "ALPHA", "BETA", "GAMMA" , "ADX", "ADY", "ADZ"]

-- Read a particle coordinates
part :: Parser Particle
part = do
  id <- mySep >> decimal <* mySep1
  coord <- count 6 (scientific <* mySep1)
  asd <-  sepBy signedInt mySep1 
  return $ Particle id coord asd

emptyLine = mySep >> endOfLine

 -- Read an iteration
iter :: Parser Iteration
iter = do 
  t <- time <* endOfLine
  header  >> endOfLine
  allPart <- sepBy part endOfLine
  return $ Iteration t allPart

parseExpr = space >> sepBy iter space 

readExpr input = case eitherResult . parse parseExpr $ input of
  Left err -> error "failed to read"
  Right val -> val
data ParserArgs = ParserArgs { input :: String
                             , output :: FilePath } 
                   deriving (Show, Data, Typeable)

parserArgs = ParserArgs { 
                input = def &= argPos 0 &= typ "INPUT"
                , output = def &= argPos 1 &= typ "OUTPUT"

main = do
  args <- cmdArgs parserArgs
  txt <- TIO.readFile $ input args
  let d = readExpr txt
  let result = T.intercalate "\n" $ map printIter d
  TIO.writeFile (output args) result
  print "done"
bgamari commented 7 years ago

What precisely do you mean by crash? Keep in mind that heap representations (especially your particular Particle representation) are generally larger than their on-disk representation. Are you certain you aren't simply running out of memory?

apraga commented 7 years ago

Thanks for the quick answer. By "crash", I mean the computer freezes and becomes unresponsive.

I've monitored memory usage and you are right, I'm running out of memory. Is there a way to decrease memory usage of my program ?

bgamari commented 7 years ago

Looking at your program, a few things stand out:

bgamari commented 7 years ago

@alexDarcy, see https://github.com/bgamari/memory-reduction for a few examples. Come find me in #haskell on irc.freenode.net if you want to chat about your problem.