Add DataFrame.Describe for reporting summary statistics

danicat commented 7 years ago

This PR adds some summary statistics helper functions to the Series struct type and the DataFrame.Describe function to obtain summary statistics for the given DataFrame. This is intended to replicate the behavior of the Pandas' DataFrame.describe() function in python.

kniren commented 7 years ago

Hello,

Thank you for your contribution, this looks great so far!

I've been checking your tests and the first thing that comes to mind is that the result from the Describe method outputs all columns as string columns:

[8x5] DataFrame

    column   A        B        C        D
 0: count    4        4        4        4
 1: mean     NaN      3.250000 6.050000 0.500000
 2: stddev   NaN      0.957427 0.818535 0.577350
 3: min      a        2        5.100000 false
 4: 25%      NaN      2.000000 5.100000 0.000000
 5: 50%      NaN      3.000000 6.000000 0.000000
 6: 75%      NaN      4.000000 6.000000 1.000000
 7: max      c        4        7.100000 true
    <string> <string> <string> <string> <string>

Wouldn't it make more sense that these statistics return float columns instead? This would mean that things like min and max will not return the actual min (In case of String or Bool elements), but a float element or NaN. We can discuss this, since I agree that having it as you programmed has its merits.

I'm not entirely sold on having a count description row, since it will be the same for all columns.

Another thing I'm thinking is that you implemented the Mean, Stdev, etc. as methods but I think if we are going to include this functions it makes more sense to have it using this prototype function: function(series.Series) series.Series. This would make it so that we can utilise these funcs on Rapply/Capply, for example.

The implementation of these statistic functions seem simple enough that maybe we should try to have our own implementation here instead of relying on gonum. This would remove the dependency of using an external library. However, I also believe that we might end up needing more functions from gonum in the future so for now I think we can leave the dependency here and adjust this for the next release.

Finally, make sure that you are working on the dev branch of the library, since I already noticed that trying to merge this in dev creates some conflicts and the tests throw errors.

Thank you very much for the hard work, I would gladly merge this when these issues are addressed.

Best, Alex

danicat commented 7 years ago

Hi Alex, following the order of your considerations:

The idea behind printing as a string was exactly to preserve the value of min/max.

Regarding the count column, yes, it does feel a bit strange at first glance, but I believe it's mostly because of the repetition. There is the scenario where the Describe function could be called on a DataFrame with a single series and then the output would not be so awkward. (Please see the Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html)

I've implemented mean, stddev, quantile, etc as func(Series) value because they all return a single value for a Series. Are you proposing to encapsulate the single return value into a Series?

About the local implementation of the mean, stddev, etc... I believe we should stick with gonum for the same reasons we stick with numpy in python implementations. To avoid rework and also to make the community stronger... the more users gonum has the better support it will get from the community. I believe it's a win/win situation.

I'm sorry about using the master branch as a reference, I was not aware of the dev branch. I'm updating my code at this moment. I'm actually wondering, what are the benefits of creating the abstraction "Elements" over the underlying collection? That's the thing that broke my code, because I was using a range loop over s.elements. That's a simple fix, but I'm wondering about the design reasoning behind that change.

Best regards,

Daniela

kniren commented 7 years ago

I think that the best compromise for the count row, is that instead of showing the number of elements in a column, it shows the non NaN elements. This will justify including it, in my opinion.

Again, I agree with leaving the gonum implementation of the stat functions. It just seems a bit excessive to include a whole library just for those (This was at least my rationale when including the library just to be able to access the gonum.Mat interface, which I remove on the dev branch for the same reason). Once we get to more advanced summary functions, including gonum will be essential, so it is possible that I'm overthinking this. In any case, lets leave it there for now and we can debate this in the future on a separate issue.

Regarding the implementation of the series functions, yes I was suggesting encapsulating it so that they would return Series with a single element even if only a value is returned. This is thinking that if we are going to expand this library to include this kind of functions, people will want to use these for their own code. Let me know if you think this is also overthinking it!

Regarding the creation of the extra abstraction, I don't really remember what was my rationale at the time, it could have been optimization purposes (See issue #16), but this could have just been flawed thinking on my part.

Best, Alex

danicat commented 7 years ago

Hi Alex,

I think that the count of non-NAs should be controlled by a parameter (ie.: removeNAs or removeMissing or rmNA) that defaults to false. This is the same approach used by most statistical functions in R (see https://stat.ethz.ch/R-manual/R-devel/library/stats/html/sd.html for an example). I believe this could also be expanded to the other functions besides count.

And we could have some special treatment for series that are numeric and series that are factors. Here's an example of the summary output from R:

screen shot 2017-07-27 at 00 16 17

You can see that there are different behaviors depending on the underlying type of the Series. Please correct me if I'm wrong, but we don't have this concept of categorial variables (factors) implemented currently, right?

Also I'm wondering if we shouldn't make a distinction between NaN (not a number) and NA (missing value). Since NaN can be the result of a computation and NA implies that the data was not collected at all.

Regarding returning the values as a series, I think there is a bit of over engineering right now. I'd prefer to keep things simple at this moment because we would be adding extra complexity without a real world use case. Kind of in the same mindset of the evolution of the Go language, we need experience reports to make this kind of decision.

I'm also worried that returning a single value as a Series could be a little misleading. What if the developer applies a single value function like StdDev to one column of the DataFrame and a mapping function in the formyn = f(xn) to another column? How would we handle the different number of rows between two series of the same DataFrame?

So instead of taking a big bang approach, I'm advocating to keep the implementation simple at this moment and keep tabs on the usage pattern to properly model it's evolution in the future. Does that make sense to you?

Best, Dani

kniren commented 7 years ago

I get what you are saying about the removeNA and such, but I think we are getting a bit off topic here. If we are to include the count row on the summary, we can choose to do so as you proposed on your original implementation or we can treat it as a count of number of non NaN elements as I proposed. Neither of this implementation is necessarily better than the other, and to be honest I still believe that this is not so important, as you can always use DataFrame.Dims if you really need the dimensions of the DataFrame, so I would rather not include the count row for this method. If you really have a compelling argument for keeping the count row, then so be it, but I really don't see how is this better than just checking the Dims, Nrow or Ncol methods for that if you actually need it.

Regarding the summary example in the same fashion as R, this was my idea as well when thinking about summarising the data. This doesn't mean that we can not implement a DataFrame.Summary method in the future that will give us a string in the same fashion as R does, but I can see the usefulness of also having your Describe method that treats the variables as numeric when appropriate and returns a new DataFrame that can then be used for further filtering. So in summary, don't bother with separating different types of variables for now and keep it as you originally planned. Also, there is no concept of factors yet in the library, and I'm debating if they should be included, but this is a case for a different issue.

Regarding the use of NA or NaN, I can understand the use case of this, but to be honest it seems just to add complexity for very little value added. In so many real life occasions I've found that values that should have been read as NaN are treated just as NA or viceversa that unless you actually understand your data and how are you are getting these measurements it is not useful to have this distinction at all. Plus this is getting very off-topic, we can continue this discussion on a separate issue if you feel so inclined.

I agree with you on keeping things simple. Let's see what the users demand from the library before taking this kind of decision for now, so keep the way that you were returning single values for Series.

Best, Alex

danicat commented 7 years ago

Yes, I agree that this conversation is drifting from the subject, but still it's an important conversation so we should have it anytime. :)

Anyway, I removed the count from the current implementation of Describe.

kniren commented 7 years ago

PR merged, thank you very much and please do keep writing your insights about how can we improve the library!

Best, Alex

danicat commented 7 years ago

Definitely! Do you think there's anything missing in order to close #12 ? Do you need a hand in any other issue?

go-gota / gota

Add DataFrame.Describe for reporting summary statistics #36