Closed flamedmg closed 9 years ago
scala-csv is implemented with parser combinator library (scala.util.parsing.combinators) of Scala but it's too slow for large files. Performace was not my interest when I chose it. I just wanted a csv parser implemented only in scala. Now I should be serious about the performance problem. I need to rewrite its parser in the different way in the next version of scala-csv.
Hi @tototoshi ,
Is there some progress on this point?
I have some 50Gb CSV file (Youhou!), and I am searching a Scala solution to enjoy monad stuff if possible. Right now I am splitting lines with the delimiter, and I use Akka to do it on several threads, I would prefer to not implement a clean parser by myself. If you are not working anymore on optimization, can you advise me another library? I am thinking to OpenCSV with a wrapper (like http://www.encodedknowledge.com/2012/04/reading-csv-files-in-scala-the-traversable-way/) but didn't tried it yet.
Regards
Ps: an idea of implementation based on Iteratees : https://jazzy.id.au/default/2012/11/06/iteratees_for_imperative_programmers.html Scalaz-Stream can be useful too?
scala-csv v0.8.0 may much faster than v1.0.0. Could you try it? It is a thin OpenCSV wrapper. The difference between v0.8.0 and v1.0.0 API is the style of specifying csv format.
Thank you, I will try it this weekend and keep you informed of the results.
Are you working on an optimization?
Regards
Just an additional datapoint: I've been trying scala-csv 1.0.0 on a CSV file of around 300MB, and it's not usable (as indicated here). I constantly get this:
Exception in thread "main" java.lang.StackOverflowError
at scala.collection.immutable.Page.latest(PagedSeq.scala:239)
at scala.collection.immutable.Page.latest(PagedSeq.scala:239)
whether I use reader.all() or reader.iterator().
For my use-case, speed is not a major concern, but I need to be able to handle CSV files of up to several gigabytes.
I found a problem in CSVReader class.
This loads all input in memory at once and consumes huge memory on large input.
@tototoshi any estimate when you might be releasing the above fix? Thanks!
@cblage Published snapshot version for now. Please try it.
resolvers += "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"
libraryDependencies += "com.github.tototoshi" %% "scala-csv" % "1.1.0-SNAPSHOT"
@tototoshi sorry I kind-of lost track of this. I'll try to give it a shot but it turns out that the size of the CSV files I'll be processing shouldn't warrant this optimization :)
@tototoshi you might want to have a look at Parboiled2.
I was getting the java.lang.StackOverflowError using 1.0.0 which made reading large csv files useless. 1.1.0-SNAPSHOT does seem to fix this so that part of the issue should be resolved. However, I can attest that it is painfully slow to read the file. I've tried both 'iterator' and 'stream' methods. Using scala.io.Source.fromFile(..) is multiple times faster.
https://github.com/tototoshi/scala-csv/commit/d02621232db4830d1698a3f1ccfcc6e7788a6bb6 I stopped using parser combinator and added a new implementation. I'm still not satisfied but it seems to be much faster than before.
Released 1.1.0.
I tried to use this library with 42MB large file - it takes forever just to complete empty reader (i.e. no processing in it). It takes less then a second to process that file in nodejs for example.
I profiled project a little bit and found, that most of the time is consumed by PagedSeq,
Please advice