tototoshi / scala-csv

CSV Reader/Writer for Scala
Other
697 stars 141 forks source link

Slow on large csv files #11

Closed flamedmg closed 9 years ago

flamedmg commented 10 years ago

I tried to use this library with 42MB large file - it takes forever just to complete empty reader (i.e. no processing in it). It takes less then a second to process that file in nodejs for example.

I profiled project a little bit and found, that most of the time is consumed by PagedSeq,

Please advice

tototoshi commented 10 years ago

scala-csv is implemented with parser combinator library (scala.util.parsing.combinators) of Scala but it's too slow for large files. Performace was not my interest when I chose it. I just wanted a csv parser implemented only in scala. Now I should be serious about the performance problem. I need to rewrite its parser in the different way in the next version of scala-csv.

pommedeterresautee commented 10 years ago

Hi @tototoshi ,

Is there some progress on this point?

I have some 50Gb CSV file (Youhou!), and I am searching a Scala solution to enjoy monad stuff if possible. Right now I am splitting lines with the delimiter, and I use Akka to do it on several threads, I would prefer to not implement a clean parser by myself. If you are not working anymore on optimization, can you advise me another library? I am thinking to OpenCSV with a wrapper (like http://www.encodedknowledge.com/2012/04/reading-csv-files-in-scala-the-traversable-way/) but didn't tried it yet.

Regards

Ps: an idea of implementation based on Iteratees : https://jazzy.id.au/default/2012/11/06/iteratees_for_imperative_programmers.html Scalaz-Stream can be useful too?

tototoshi commented 10 years ago

scala-csv v0.8.0 may much faster than v1.0.0. Could you try it? It is a thin OpenCSV wrapper. The difference between v0.8.0 and v1.0.0 API is the style of specifying csv format.

https://github.com/tototoshi/scala-csv/tree/0.8.0

pommedeterresautee commented 10 years ago

Thank you, I will try it this weekend and keep you informed of the results.

Are you working on an optimization?

Regards

haakonn commented 10 years ago

Just an additional datapoint: I've been trying scala-csv 1.0.0 on a CSV file of around 300MB, and it's not usable (as indicated here). I constantly get this:

Exception in thread "main" java.lang.StackOverflowError
at scala.collection.immutable.Page.latest(PagedSeq.scala:239)
at scala.collection.immutable.Page.latest(PagedSeq.scala:239)

whether I use reader.all() or reader.iterator().

For my use-case, speed is not a major concern, but I need to be able to handle CSV files of up to several gigabytes.

tototoshi commented 10 years ago

I found a problem in CSVReader class.

https://github.com/tototoshi/scala-csv/blob/1.0.0/src/main/scala/com/github/tototoshi/csv/CSVReader.scala#L32

This loads all input in memory at once and consumes huge memory on large input.

cblage commented 10 years ago

@tototoshi any estimate when you might be releasing the above fix? Thanks!

tototoshi commented 10 years ago

@cblage Published snapshot version for now. Please try it.

resolvers += "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"

libraryDependencies += "com.github.tototoshi" %% "scala-csv" % "1.1.0-SNAPSHOT"
xuwei-k commented 10 years ago

related?

cblage commented 10 years ago

@tototoshi sorry I kind-of lost track of this. I'll try to give it a shot but it turns out that the size of the CSV files I'll be processing shouldn't warrant this optimization :)

andyczerwonka commented 10 years ago

@tototoshi you might want to have a look at Parboiled2.

sthomp commented 9 years ago

I was getting the java.lang.StackOverflowError using 1.0.0 which made reading large csv files useless. 1.1.0-SNAPSHOT does seem to fix this so that part of the issue should be resolved. However, I can attest that it is painfully slow to read the file. I've tried both 'iterator' and 'stream' methods. Using scala.io.Source.fromFile(..) is multiple times faster.

tototoshi commented 9 years ago

https://github.com/tototoshi/scala-csv/commit/d02621232db4830d1698a3f1ccfcc6e7788a6bb6 I stopped using parser combinator and added a new implementation. I'm still not satisfied but it seems to be much faster than before.

tototoshi commented 9 years ago

Released 1.1.0.