Passing only one weights array for cross-correlations of unequal sizes

manodeep / Corrfunc

⚡️⚡️⚡️Blazing fast correlation functions on the CPU.

https://corrfunc.readthedocs.io

MIT License

165 stars 50 forks source link

Passing only one weights array for cross-correlations of unequal sizes #180

Closed lgarrison closed 5 years ago

lgarrison commented 5 years ago

General information

Corrfunc version: v2.2

Issue description

Passing only one weights array when doing a cross-correlation of particle sets of different sizes causes a RuntimeError.

The cause is that the Python wrappers create a uniform weights array for the second set of particles if only the first was provided, but it is created with size equal to the first particle set.

Minimal failing example

import numpy as np
from Corrfunc.theory.DD import DD

N_d = 10000
Lbox = 250.
N_r = N_d*5

x_DM_jack = np.random.uniform(0.,Lbox,N_d)                                                                  
y_DM_jack = np.random.uniform(0.,Lbox,N_d)                                                                  
z_DM_jack = np.random.uniform(0.,Lbox,N_d)

x_rand_jack = np.random.uniform(0.,Lbox,N_r)                                                                  
y_rand_jack = np.random.uniform(0.,Lbox,N_r)                                                                  
z_rand_jack = np.random.uniform(0.,Lbox,N_r)

w_DM_jack = np.full_like(x_DM_jack,1.)                                                                        
w_DM_jack[:] = np.random.uniform(1.,2.,N_d)

bins = np.logspace(-1,1.,25)

autocorr = 0                                                                                 
results = DD(autocorr,nthreads=16,binfile=bins,                                            
                           X1=x_rand_jack, Y1=y_rand_jack, Z1=z_rand_jack,
                           X2=x_DM_jack, Y2=y_DM_jack, Z2=z_DM_jack,weights2=w_DM_jack,                  
                           boxsize=Lbox,periodic=False)

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
error: ERROR: the last dimension of `weights` must match the number of positions.  Instead found n_weights=10000, nx=50000

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-7-fac2f1ff5530> in <module>
     23                            X1=x_rand_jack, Y1=y_rand_jack, Z1=z_rand_jack,
     24                            X2=x_DM_jack, Y2=y_DM_jack, Z2=z_DM_jack,weights2=w_DM_jack,
---> 25                            boxsize=Lbox,periodic=False)

~/anaconda3/lib/python3.6/site-packages/Corrfunc-2.3.0-py3.6-linux-x86_64.egg/Corrfunc/theory/DD.py in DD(autocorr, nthreads, binfile, X1, Y1, Z1, weights1, periodic, X2, Y2, Z2, weights2, verbose, boxsize, output_ravg, xbin_refine_factor, ybin_refine_factor, zbin_refine_factor, max_cells_per_dim, enable_min_sep_opt, c_api_timer, isa, weight_type)
    246     if extn_results is None:
    247         msg = "RuntimeError occurred"
--> 248         raise RuntimeError(msg)
    249     else:
    250         extn_results, api_time = extn_results

RuntimeError: RuntimeError occurred

Thanks to @boryanah for finding this and providing the failing example!

manodeep commented 5 years ago

What is the expected behaviour here? If there was only a scalar passed, then I can understand numpy-style dimension broadcasting. But if there is only one weights array passed, then we should raise a ValueError and request that the user pass another weights array of shape (num_weights, ND2) with the values all set to 1.0 for PAIR_PRODUCT (or whatever would be equivalent for their weight type for custom weighting schemes). The num_weights for the second weights array should match that of the first array.

lgarrison commented 5 years ago

The expected behavior is that the Python wrapper automatically provides such an array. Indeed, this is the current documented behavior (but only works as intended when the two sets are of equal size).

I think it's sensible to raise a ValueError if the weighting scheme is not pair_product.

On Thu, May 16, 2019 at 6:17 PM Manodeep Sinha notifications@github.com wrote:

What is the expected behaviour here? If there was only a scalar passed, then I can understand numpy-style dimension broadcasting. But if there is only one weights array passed, then we should raise a ValueError and request that the user pass another weights array of shape (num_weights, ND2) with the values all set to 1.0 for PAIR_PRODUCT (or whatever would be equivalent for their weight type for custom weighting schemes). The num_weights for the second weights array should match that of the first array.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/180?email_source=notifications&email_token=ABLA7S6LKY4ENT7BMXAEHNTPVXMQTA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTGS5I#issuecomment-493250933, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLA7S4QJRWRZAR4BIJ4G3LPVXMQTANCNFSM4HNPNDWA .

manodeep commented 5 years ago

I am not sure how to automatically provide the array for the second set of particles. Say, the weights1 is full of uniform random numbers - how do you generate an equivalent array for the second dataset? Are you saying to generate an array of the correct shape only when weights1 contains only 1 element (i.e., weights1 is a scalar)?

lgarrison commented 5 years ago

I just meant generate an array filled with 1s.

On Thu, May 16, 2019 at 6:40 PM Manodeep Sinha notifications@github.com wrote:

I am not sure how to automatically provide the array for the second set of particles. Say, the weights1 is full of uniform random numbers - how do you generate an equivalent array for the second dataset? Are you saying to generate an array of the correct shape only when weights1 contains only 1 element (i.e., weights1 is a scalar)?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/180?email_source=notifications&email_token=ABLA7S6M5DYUOEDT6RP6AUDPVXPGRA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTHZUY#issuecomment-493255891, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLA7S4EALGZ55QB2WBYHWDPVXPGRANCNFSM4HNPNDWA .

boryanah commented 5 years ago

Hi!

I also wonder: I am writing my own estimator to convert DD, DR, RR for non-periodic BC into cf, but I wonder about whether there are any normalization factors when you have weights (I thought there shouldn't but without them I don't seem to be getting the right answer) -- this also doesn't seem to be in convert_3d_counts_to_cf(), but I may be wrong.

Thank you for helping.

Best, Boryana

On Thu, May 16, 2019 at 6:36 PM Lehman Garrison notifications@github.com wrote:

The expected behavior is that the Python wrapper automatically provides such an array. Indeed, this is the current documented behavior (but only works as intended when the two sets are of equal size).

I think it's sensible to raise a ValueError if the weighting scheme is not pair_product.

On Thu, May 16, 2019 at 6:17 PM Manodeep Sinha notifications@github.com wrote:

What is the expected behaviour here? If there was only a scalar passed, then I can understand numpy-style dimension broadcasting. But if there is only one weights array passed, then we should raise a ValueError and request that the user pass another weights array of shape (num_weights, ND2) with the values all set to 1.0 for PAIR_PRODUCT (or whatever would be equivalent for their weight type for custom weighting schemes). The num_weights for the second weights array should match that of the first array.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub < https://github.com/manodeep/Corrfunc/issues/180?email_source=notifications&email_token=ABLA7S6LKY4ENT7BMXAEHNTPVXMQTA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTGS5I#issuecomment-493250933 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ABLA7S4QJRWRZAR4BIJ4G3LPVXMQTANCNFSM4HNPNDWA

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/180?email_source=notifications&email_token=AKJFGGBXUTRIO7GWOWRU2T3PVXOVZA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTHSYI#issuecomment-493255009, or mute the thread https://github.com/notifications/unsubscribe-auth/AKJFGGAPSBTVWUTGBEPVQNTPVXOVZANCNFSM4HNPNDWA .

manodeep commented 5 years ago

@lgarrison Ahh okay. So a matrix of the correct shape, filled with 1's for the PAIR_PRODUCT case. ValueError with the "Please supply weights" for all other weighting schemes...

@boryanah Yeah - we really don't have an utility to compute the weighted correlation function (or even check that weights have been used and warn the user). This has been open for a while now in issue #139

boryanah commented 5 years ago

Thanks for the reference!

I am really sorry if this is very obvious, but which pair counts should be multiplied by the average in what is fed to the convert function in the case of different weights for R and D and pair_product type of weighting. I tried: cf = convert_3d_counts_to_cf(N_d, N_d, N_r, N_r,

                         DD*w_avg,

DR*w_avg,

                         DR*w_avg, RR)

where w_avg is the mean weight of the data (mean weight of rand is 1), but it didn't give me the right answer

On Thu, May 16, 2019 at 6:52 PM Manodeep Sinha notifications@github.com wrote:

@lgarrison https://github.com/lgarrison Ahh okay. So a matrix of the correct shape, filled with 1's for the PAIR_PRODUCT case. ValueError with the "Please supply weights" for all other weighting schemes...

@boryanah https://github.com/boryanah Yeah - we really don't have an utility to compute the weighted correlation function (or even check that weights have been used and warn the user). This has been open for a while now in issue #139 https://github.com/manodeep/Corrfunc/issues/139

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/180?email_source=notifications&email_token=AKJFGGBZZEUORBFGSCYK4P3PVXQTLA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTILJA#issuecomment-493258148, or mute the thread https://github.com/notifications/unsubscribe-auth/AKJFGGCRUBC2KNGJUPAYGJLPVXQTLANCNFSM4HNPNDWA .

manodeep commented 5 years ago

@boryanah You have to first compute the average weights on the dataset1 (and dataset2 for cross-correlations) with avgweight := sum(weights)/N_particles. In each radial bin, the total sum of weights is weightsum := weightavg * num_pairs_in_that_bin (where the weightavg is in itself a sum of the product of the two weights). Now, you have to figure out what would have been the weightsum if the points were randomly distributed -- so multiply by the number of points in one catalog with the density, volume, and average weight of the other catalog.

@lgarrison One thing I realise is that our rpavg is currently completely unweighted. We should probably also have an option to say return_weighted_rpavg -- where the math will be rp += weight*r within the kernels (instead of rp += r) and then divided by weightavg * num_pairs_in_bin at the end. (I have a vague memory of we discussing this previously, and deciding against weighted rpavg for backwards compatibility)

manodeep commented 5 years ago

@boryanah And then use the different weightsum for each DD, DR, and RR and combine with Landy-Szalay to get the weighted average. @lgarrison Will you please check that I have not omitted steps (or completely mis-stated) ? :)

lgarrison commented 5 years ago

That sounds right to me! It seems like we should be returning weightsum instead of weightavg; it was a poor choice on my part. I guess it couldn't hurt to add the latter to the results struct.

And good point about rpavg. One option would be to return both, although that would be extra work in the kernels.

On Thu, May 16, 2019 at 11:24 PM Manodeep Sinha notifications@github.com wrote:

@boryanah https://github.com/boryanah And then use the different weightsum for each DD, DR, and RR and combine with Landy-Szalay to get the weighted average. @lgarrison https://github.com/lgarrison Will you please check that I have not omitted steps (or completely mis-stated) ? :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/180?email_source=notifications&email_token=ABLA7S4MDRZCOAZ6RDZ2LYLPVYQPTA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTT4FA#issuecomment-493305364, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLA7SZDD7RFLW6UGUBDTIDPVYQPTANCNFSM4HNPNDWA .

boryanah commented 5 years ago

Dear Manodeep,

Thanks a lot! This is what I understood (I think I got 2 wrong)

Say 1 is Data; 2 is Random. We are trying to compute the factor that should multiply RD before it gets into the LS estimator. This factor will be F

1) N1,2 = number of particles Np1,2 = number of pairs per bin (vector) w_avg1.2 = "weight average" as returned by the DD function when computing DD(XYZ1,XYZ2, weights1,weights2) (also vector) sum1,2 = w_avg1,2*Np1,2 (vector) sum of the products of weights in a given bin

2) We also have mean weights: w_mean1,2 = sum(weights1,2)/N1,2 To compute the sum of weights in the case of random for dataset 1 with weights for 2, we do sum_rand1 = N1 [ N2/Vol Vol w_mean2 ] = N1 N2 w_mean2 (I think that I misunderstood this, maybe it is w_mean1w_mean2N1N2 for same volume?)

3) Here you say multiply the sum of weights per bin, sum1, by what would have been the sum of weights for the catalog if random, which is sum_rand1, so F1 = sum_rand1*sum1

4) We substitute DR1 and DR2 in LS (or convert 3d to cf) with DR1F1 and DR2F2

This doesn't seem symmetric, so I guess this is DR1 and there is also DR2 which should go into the final estimator.

Best, Boryana

On Thu, May 16, 2019 at 11:24 PM Manodeep Sinha notifications@github.com wrote:

@boryanah https://github.com/boryanah And then use the different weightsum for each DD, DR, and RR and combine with Landy-Szalay to get the weighted average. @lgarrison https://github.com/lgarrison Will you please check that I have not omitted steps (or completely mis-stated) ? :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/180?email_source=notifications&email_token=AKJFGGCMG7XD7RIUGVMMVGTPVYQPVA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTT4FA#issuecomment-493305364, or mute the thread https://github.com/notifications/unsubscribe-auth/AKJFGGGJXAQUUJS7QK4IPR3PVYQPVANCNFSM4HNPNDWA .

boryanah commented 5 years ago

I suspect that for 2) sum_avg1 = N1* [N2/Vol Vol_bin w_mean2] and it is thus a vector and Vol is the total volume while Vol_bin is only volume of the bin

and for 3) maybe we want F1 = sum_rand1/sum1

I am really sorry for bothering you.

On Fri, May 17, 2019 at 12:07 AM Hadzhiyska, Boryana < boryana.hadzhiyska@cfa.harvard.edu> wrote:

Dear Manodeep,

Thanks a lot! This is what I understood (I think I got 2 wrong)

Say 1 is Data; 2 is Random. We are trying to compute the factor that should multiply RD before it gets into the LS estimator. This factor will be F

1) N1,2 = number of particles Np1,2 = number of pairs per bin (vector) w_avg1.2 = "weight average" as returned by the DD function when computing DD(XYZ1,XYZ2, weights1,weights2) (also vector) sum1,2 = w_avg1,2*Np1,2 (vector) sum of the products of weights in a given bin

2) We also have mean weights: w_mean1,2 = sum(weights1,2)/N1,2 To compute the sum of weights in the case of random for dataset 1 with weights for 2, we do sum_rand1 = N1 [ N2/Vol Vol w_mean2 ] = N1 N2 w_mean2 (I think that I misunderstood this, maybe it is w_mean1w_mean2N1N2 for same volume?)

3) Here you say multiply the sum of weights per bin, sum1, by what would have been the sum of weights for the catalog if random, which is sum_rand1, so F1 = sum_rand1*sum1

4) We substitute DR1 and DR2 in LS (or convert 3d to cf) with DR1F1 and DR2F2

This doesn't seem symmetric, so I guess this is DR1 and there is also DR2 which should go into the final estimator.

Best, Boryana

On Thu, May 16, 2019 at 11:24 PM Manodeep Sinha notifications@github.com wrote:

@boryanah https://github.com/boryanah And then use the different weightsum for each DD, DR, and RR and combine with Landy-Szalay to get the weighted average. @lgarrison https://github.com/lgarrison Will you please check that I have not omitted steps (or completely mis-stated) ? :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/180?email_source=notifications&email_token=AKJFGGCMG7XD7RIUGVMMVGTPVYQPVA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTT4FA#issuecomment-493305364, or mute the thread https://github.com/notifications/unsubscribe-auth/AKJFGGGJXAQUUJS7QK4IPR3PVYQPVANCNFSM4HNPNDWA .

manodeep commented 5 years ago

Hey @boryanah - no worries at all. Unfortunately, I have a deadline right now but if I have not responded by Monday, please ping me again!

And thank you for catching the bug in the first place!

boryanah commented 5 years ago

Thanks a lot! Good luck!! I appreciate it!

On Fri, May 17, 2019 at 12:39 AM Manodeep Sinha notifications@github.com wrote:

Hey @boryanah https://github.com/boryanah - no worries at all. Unfortunately, I have a deadline right now but if I have not responded by Monday, please ping me again!

And thank you for catching the bug in the first place!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/180?email_source=notifications&email_token=AKJFGGHWFOITATYIGER3KCTPVYZILA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTWWLI#issuecomment-493316909, or mute the thread https://github.com/notifications/unsubscribe-auth/AKJFGGCLQKNHCCC44CNYPFDPVYZILANCNFSM4HNPNDWA .

manodeep commented 5 years ago

@boryanah Thanks for your patience. Would it be possible to add in some lines of code to demonstrate what you are doing?

boryanah commented 5 years ago

Hi,

I will try to write it up soon, but I have switched to something else for the moment!

Best, Boryana

On Mon, May 20, 2019 at 10:57 PM Manodeep Sinha notifications@github.com wrote:

@boryanah https://github.com/boryanah Thanks for your patience. Would it be possible to add in some lines of code to demonstrate what you are doing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/180?email_source=notifications&email_token=AKJFGGBM2D3QMQ6QLKRI7F3PWNQKVA5CNFSM4HNPNDWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV2TLWI#issuecomment-494220761, or mute the thread https://github.com/notifications/unsubscribe-auth/AKJFGGFCJU7JSTMNGEJHUU3PWNQKVANCNFSM4HNPNDWA .