picnicml / doddle-model

:cake: doddle-model: machine learning in Scala.
https://picnicml.github.io
Apache License 2.0
137 stars 23 forks source link

Optimize performance of CSVLoader #105

Open inejc opened 5 years ago

inejc commented 5 years ago

The current implementation is very slow, I think a better approach would be to implement a custom solution rather than using a third-party library.

plokhotnyuk commented 5 years ago

@inejc feel free to peak routines and tricks from the jsoniter-scala-core module.

Here are results of benchmarks for estimation of possible throughput and allocations.

inejc commented 5 years ago

@plokhotnyuk thanks for the pointers! I will look at your solution. Are you perhaps aware of any existing and efficient CSV loading libraries on JVM?

plokhotnyuk commented 5 years ago

There are a lot of solutions for Java: https://github.com/uniVocity/csv-parsers-comparison

But a custom codec which is based on jsoniter-scala-core outperforms them greatly when numbers and strings are represented as JSON values. That require wrapping all string values by " characters and using UTF-8 encoding or hexadecimal escaping for non-ASCII characters, and not using numbers with leading zeroes.

If implementation that is locked to JSON representation for string and numbers is not acceptable you can fork and replace it by other for other rules and encoding formats using the same approaches and hacks.

inejc commented 5 years ago

I merged https://github.com/picnicml/doddle-model/pull/106 but keeping this issue open as we want to improve the current solution. Preferably look into the examples given by @plokhotnyuk.