hnolCol / instantclue

Instant Clue - Interactive Data Analysis
http://www.instantclue.de
GNU General Public License v3.0
21 stars 4 forks source link

Subtract by median unexpected behaviour #3

Closed pdcharles closed 6 years ago

pdcharles commented 6 years ago

The behaviour of the operation 'Row & column calculations -> Basic -> Substract by... -> Column Median' produces unexpected results on the Mac release of v0.4.9.

Reproduction: Using the "fixed acidity" column from "winequality-white.csv" perform two sets of operations (all within 'Row & column calculations -> Basic' A (divide-then-log):

  1. 'Divide by... -> Column Median'
  2. 'Logarithmic -> log2' B (log-then-subtract):
  3. 'Logarithmic -> log2'
  4. 'Substract by... -> Column Median' operation

Expected outcome: the results of A and B should be identical. The medians of A and B should be 0. Observed outcome: B = A - ~4.03

Scatter plot: image

When I perform the the log-then-subtract sequence but instead manually input the median value (obtained from 'Summary Statistics -> 50%') by using 'Substract by... -> Value' rather than 'Substract by... -> Column Median', then the result is identical to A.

It looks like you might have a bug in analyze_data.py:1164 - you do not subset the list of median values before zipping with selectedColumns, so it will always pair the first value from selectedColumns with the median of the first column in the entire data frame.

P.S. 'Substract' is (typically considered) a typo, with the correct spelling being 'Subtract'.

hnolCol commented 6 years ago

Thank you very much for your feedback and pull request, absolutely right. Typo is correct as well.