Open ShukriChiu opened 2 years ago
Hi @lwhite1 @ShukriChiu, it looks like the underlying method implemented by Apache does return an interpolated value. There is a total of 10 (LEGACY + R_1 ~9) formulae enumerated in Apache's class to calculate the index of the percentile. If we change the field estimationType from LEGACY (the default input) to R_7, the returned interpolated value will be 4.82. Apache's implementation An intro from Wikipedia
I would prefer adding a new method rather than changing the current one in a way that would change the result. That's a subtle and significant breaking change.
On Thu, Aug 25, 2022 at 1:18 PM mo20053444 @.***> wrote:
Hi @lwhite1 https://github.com/lwhite1 @ShukriChiu https://github.com/ShukriChiu, it looks like the underlying method implemented by Apache does return an interpolated value. There is a total of 10 (LEGACY + R_1 ~9) formulae enumerated in Apache's class to calculate the index of the percentile. If we change the field estimationType from LEGACY (the default input) to R_7, the returned interpolated value will be 4.82. Apache's implementation https://github.com/apache/commons-math/blob/master/commons-math-legacy/src/main/java/org/apache/commons/math4/legacy/stat/descriptive/rank/Percentile.java#:~:text=public%20enum-,EstimationType,-%7B An intro from Wikipedia https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
— Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/1092#issuecomment-1227550872, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2FPAUZIKAOLPZPUEGRMRTV26TGDANCNFSM5SKURJRQ . You are receiving this because you were mentioned.Message ID: @.***>
@lwhite1 Yes, I totally agree. Do we add another method with the Estimation Type of R_7 hard coded in it? Or give users an option (as a parameter) to pick what Estimation Type they would like to use (among R_1 through 9)? I think the latter will need to expose Apache’s implementation of Estimation Type Enumerator to the users (i.e., the users will have to type in EstimationType.R_1 as a parameter if they want to use it).
I think the ideal approach might be to wrap the estimation types in a tablesaw enum.
On Sat, Aug 27, 2022 at 10:15 AM mo20053444 @.***> wrote:
@lwhite1 https://github.com/lwhite1 Yes, I totally agree. Do we add another method with the Estimation Type of R_7 hard coded in it? Or give users an option (as a parameter) to pick what Estimation Type they would like to use (among R_1 through 9)? I think the latter will need to expose Apache’s implementation of Estimation Type Enumerator to the users (i.e., the users will have to type in EstimationType.R_1 as a parameter if they want to use it).
— Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/1092#issuecomment-1229200479, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2FPAUNSOL2BDVQ4UR5TWDV3IPGTANCNFSM5SKURJRQ . You are receiving this because you were mentioned.Message ID: @.***>
Hi @lwhite1, I created a PR earlier this week for another percentile method allowing users to choose a specific estimation type from a newly created tablesaw enum to calculate percentile. However, after some further thinking, do you think it is a better idea to just add 9 new percentile methods among which an estimation type (R1 to 9) is provided in each method?
For example: /* Returns the given percentile of the values in the argument with R1 as estimation type/ public static Double percentileR1(NumericColumn<?> data, Double percentile) { return new Percentile().withEstimationType(EstimationType.R_1).evaluate(removeMissing(data), percentile); }
/* Returns the given percentile of the values in the argument with R2 as estimation type/ public static Double percentileR2(NumericColumn<?> data, Double percentile) { return new Percentile().withEstimationType(EstimationType.R_2).evaluate(removeMissing(data), percentile); } ...
In this way users don't need to input a potential tablesaw's Enum.R1 as an argument since neither the potential tablesaw's enum nor Apache's enum of EstimationType needs to be exposed to users.
For example, the data is (3.8,4.5,4.6,4.7,4.9) while I'm using tech.tablesaw.aggregate.AggregateFunctions.percentile function, the 90th percentile is 4.9, however, if the percentile function supports linear interpolation, the 90th percentile should be 4.82, which is adopted by most other programming languages.