go-gota / gota

Gota: DataFrames and data wrangling in Go (Golang)
Other
2.98k stars 277 forks source link

Allow encoding of certain columns to reduce memory usage #56

Open alrpal opened 6 years ago

alrpal commented 6 years ago

Memory consumption is an issue when dealing with large data sets.

Similar in memory columnar stores like python's panadas and microsoft's proprietary vertipaq engine for it's ssas products have the ability to minimize memory usage by using techniques such as:

  1. Value encoding - for numbers, vertipaq will calculate a number it can subtract from every row to lower the requirement of bits needed.

  2. Categorical encoding (dictionary encoding) - for strings, pandas and vertipaq will create a lookup table and use integers to represent the data therefore reducing number of bits.

More info: https://www.microsoftpressstore.com/articles/article.aspx?p=2449192&seqNum=3

This is a feature request for similar functionality.

kniren commented 4 years ago

I would love to have a better solution. To be honest the current implementation is not how I would do things anymore. I tried to be idiomatic and extensible, but I should have focused more on memory efficiency and performance. You live and learn I suppose, but dealing with this would require an entire architecture of the project. If someone is up for the challenge, I'll love to see what they come up with.