picnicml / doddle-model

:cake: doddle-model: machine learning in Scala.
https://picnicml.github.io
Apache License 2.0
137 stars 23 forks source link

Optimize CSV loading #106

Closed inejc closed 5 years ago

inejc commented 5 years ago

Addresses https://github.com/picnicml/doddle-model/issues/105. It removes a third-party dependency. It is possible to load a ~1Mx513 matrix in a few minutes (I couldn't even measure time previously). The downside is that we simply use row.split(",") which means that we are not able to parse strings with , in them (commas that don't separate columns) but I'm happy to introduce this limitation for performance benefits (we can improve later if needed).

Loading of a dataset with only numerical features should be faster than the loading of a dataset with categoricals.

inejc commented 5 years ago

I'll merge this but keep https://github.com/picnicml/doddle-model/issues/105 opened as we want to improve the current solution.