semio / ddf_utils

Utilities for working with DDF datasets
https://open-numbers.github.io/
MIT License
2 stars 1 forks source link

ddf diff can not handle multiple on_key option #131

Open semio opened 3 years ago

semio commented 3 years ago

some diff stats like rval requires grouping data by one dimension (usually geo) before computing the stats. But sometimes there are multiple choices of the groruping dimension. such as in SG, we have datapoints by geo and global/regions. so some datapoints file should groupby global/region first. But I can only supply one value to on_key option of ddf diff, which will cause error.

jheeffer commented 3 years ago

Isn't it always group by all keys except for time, if it is in key? So each group only has time changing in key?

I guess rval only makes sense with when is a time series?

semio commented 3 years ago

No, rval can be used to compare other type of data. Our goal is to tell how different are new and old datapoints, so in fact we just need to ensure that we are comparing the same observation for each datapoint, which means that there is no need to do grouping at all

I guess I grouped them by country and calculate the average rval to show average diff of all countries. But I am not sure that if the average is a better indicators than the rval for all datapoints. Also seems average rval is not meanful, see https://www.researchgate.net/post/average_of_Pearson_correlation_coefficient_values

I suggest that let's remove the grouping for now and if necessary check with our statistician to see which indicators should be use