x axis wrong? - Githubissues

tejelt commented 5 years ago

In the plot that Lena made on the plane, the lambda values of at least a couple of clusters seem to be wrong. Most noticeable are two clusters with Tx>10 which appear in the plot with lambda ~ 35, but in the catalog actually have lambda ~ 55. Noner2500_temperature-lambda.pdf

sweverett commented 4 years ago

@jjobel @paigemkelly now that we are moving on to the plotting functions, this is something to keep in mind. I haven't double checked yet, but I assume that this has to do with an issue with the pivot. For example, here's what plotting the simple output of the scaled data looks like: These lambdas don't make sense because it's really plotting ln(lambda) - pivot(x) = ln(lambda) - median(ln(lambda)). For the final plots I reverse this scaling in plotlib.py, but there may be a bug that does this incorrectly. Something to keep an eye out for.

sweverett commented 4 years ago

Actually, just looking at that it seems wrong. Isn't the pivot supposed to be the median of lambda, not the median of the scaled lambda @tejelt ?

sweverett commented 4 years ago

Ok now I'm unsure again. In my mind, the idea of the pivot is to choose the "center" of the data that you are fitting to minimize correlation between your fitted slope & intercept. In that case, it would seem that the choice of median of the scaled lambda is correct.

tejelt commented 4 years ago

What do you mean by scaled lambda? You want ln(lambda/pivot) where pivot is the median lambda. The ln(median(lambda)) should be (roughly) the same as median(ln(lambda). The plot axes are definitely weird above. Also not sure of the y-axis. This must be scaled by something.

sweverett commented 4 years ago

Ignore the y, this was just a test plot the rewrite branch made to make sure it was running. Don't take the numbers seriously.

I went through the math and think I've discovered my confusion. In my mind, the whole point of a pivot is to shift the data distribution to the center to minimize the correlation on slope & intercept. So I visualized it like this: y = m * (x' - x_0) + b where x_0 is the pivot. With this definition, x_0 = pivot = med(x) = med(ln(lambda)), where x' is the lambda in the scaled space (what I meant by "scaled lambda"). However, I could not get this to work consistently with ln(lambda/pivot). It looks like what people instead do is the following: x_0 = med(ln(lambda)) = ln(lambda_p)=ln(pivot) where lambda_p is the lambda corresponding to the median in the scaled ln space, x'. So I think this all came down to me thinking that x_0 was the pivot (which is normally the convention when you're just dealing with a linear fit outside of any transformations), whereas here it is lambda_p. Does any of that make sense?

sweverett commented 4 years ago

Here's a shorter version of my argument, starting from the usual definition:

L = a * (lambda / pivot) ^ b
ln(L) = ln(a) + b*[ln(lambda) - ln(pivot)]
y = intercept + slope*(x' - x_0)
y = intercept + slope*x

Thus x_0 is not the pivot referenced by the usual equation, and so the pivot that clustr.py computes:

# Log-x before pivot
 xlog = np.log(data.x)

# Set pivot
if piv_type == 'median':
    piv = np.median(xlog)

# Scale log_x by pivot
log_x = xlog - piv

is inconsistent with the usual equation.

sweverett commented 4 years ago

Now this difference in definition may not actually matter. Here is the unscale() function in the plotting code, which takes the data in the fitted (x,y) space to the original (lambda, L) space:

def unscale(x, y, x_err, y_err, x_piv):
    ''' Recover original data from fit-scaled data '''
    return (np.exp(x + x_piv), np.exp(y), x_err * x, y_err * y)

This transformation is completely consistent with my definition of x_piv from above, as:

ln(L) = ln(a) + b*[ln(lambda) - ln(lambda_p)]
-> y = intercept + slope * [ln(lambda) - pivot]
-> y = intercept + slope * x
-> lambda = e ^ (x + pivot)

So I don't see how the pivot would cause an incorrect lambda in the plots. But we can easily double check this with some tests.

sweverett commented 4 years ago

The plot_scatter function takes the data from the loaded catalog and plots it directly with errorbar(), it doesn't even interact with any of the scaling or unscaling functions. So any bug in displayed lambda values would come from the catalog reading itself, which I find to be much less likely.

The place I was worried about this was if I was displaying lambda / lambda_piv incorrectly on the plots by using x_piv. However, it looks like I was lazy and just had it print out x / x_piv: Thus, so far I don't see any bugs or inconsistencies other than vocabulary.

sweverett / CluStR

x axis wrong? #28