go-gota / gota

Gota: DataFrames and data wrangling in Go (Golang)
Other
2.98k stars 276 forks source link

Discussion/Thoughts: Mutation of series or dataframes #108

Open mannharleen opened 4 years ago

mannharleen commented 4 years ago

This is a placeholder to discuss how we should treat operations on series and dataframes that amed the underlying datatype.

For example, for series we have the following functions:

Question: Is there merit on discussion how the project should treat such operations for series and dataframes? Or is there already an understanding?

typeless commented 4 years ago

What are the motivations/goals of this issue? Are you concerned about, say, performance or API design?

mannharleen commented 4 years ago

I started off from an API design point of view. But I believe the bigger question is performance. For instance, having a Map operation for Series is great, but should it return a new Series or map in place? What works for gota? and Why?

typeless commented 4 years ago

Having immutable data has some benefits. For instance, when chaining multiple operations over a series, we don't have to manually clone the operands in advance. However, I don't oppose the idea of supplementary APIs for in-place updates.

Regarding performance, I propose that we should make the individual elements of a series unexposed. So, we can store the elements in flat memory layout (except for strings), rather than a slice of interfaces pointing to heap values.

Edit: I have an experimental PR in my local repo, which has some preliminary refactoring for the aforementioned proposal. I thought that would break the APIs too much that I didn't expect to upstream it.

kniren commented 4 years ago

Immutability was a conscious decision during the API design. I understand the potential benefits of mutating in place in terms of performance and memory usage. However, this library is not necessarily focused on extracting the maximum amount of performance, but rather on providing a somewhat safe API for data manipulation.

If there is a real need for more performance, a lot more thought should be put in the memory layout of the data and other operations. As @typeless mentions, performance is bottlenecked by the way that the current memory model works. I initially designed for 'code reusability', but after a lot more experience with low level programming, I'm not sure it was the right call.