go-gota / gota

Gota: DataFrames and data wrangling in Go (Golang)
Other
2.98k stars 277 forks source link

Add support for GroupBy and Summarize #33

Closed kyle-hamlin closed 6 years ago

kyle-hamlin commented 6 years ago

A fundamental feature of dataframes is grouping by column/s and summarizing (mean, median, max, min, etc..) other column/s, are you thinking about implementing this functionality?

kniren commented 6 years ago

I agree that this is a very interesting feature to have and it is on the roadmap.

Unfortunately I have not had so much time lately to implement big improvements on Gota, if you come up with a good solution we can discuss it here or via PR. Otherwise I will get to it when I have some more free time available.

For the meantime, you have functions that should allow you to apply functions to rows and columns via RApply and CApply and you could split the groups yourself and then join them back together.

Make sure to check and comment on issue #13 for future developments regarding GroupBy, etc.

Best, Alex

kyle-hamlin commented 6 years ago

I've actually never written any go but I'm an avid pandas user. I was thinking this could be a good project for me to get my feet wet. If you have any starting design ideas or pointers/recommendations that could help guide me I would love to hear them. I will try to go over your code and think about how to implement this GroupBy functionality, and try to share my thoughts here as I work.

kniren commented 6 years ago

Awesome, I would love to get this implemented for sure. For a start, check issue #13, where I talk about this concept.

Go is a wonderful and sensible language, best of luck getting into it!

Essentially GroupBy should be creating an internal index for the groups of rows that work together, and then we could move further expanding existing functions to accommodate this groups (So for example, sorting or function application is done on a per group basis.

I encourage you to start contributing small, since that also makes my life much easier when reviewing the code, so for a start, just with the index creation of groups as a PR.

In order to contribute to the project, make sure to work on the dev branch and submit the PRs there. All the code for major features should have at least a sensible amount of unit testing using Go's testing capabilities. Furthermore the tests go test, the linter golint and go vet should not throw any errors, which will also force the preferred documentation best practices for exported functions. Also gofmt is mandatory, so you should probably just run it automatically after saving.

I urge you to comment on issue #13 instead of this one, which I closed to avoid duplicate issues.

Thanks for the interest and let me know your thoughts!