xcf covariance matrix of transmission field is not positive definite

londumas commented 6 years ago

Running do_xcf.py on transmission files: d_i(lambda) = T_i(lambda)/<T(lambda)>-1 for enough forests: 500,000, I get a covariance matrix that is Warning: Matrix is not positive definite. This is not true for do_cf.py where with already 100,000 forests I get a covariance matrix that is positive definite. This issue does not appear when the mocks are fully expanded by quickquasar.

The following plot compares the covariance matrix from transmission (Lya) and from the full expanded mocks (Lya+Continuum) for the diagonal and the off-diagonal terms, for the cross and the auto-correlation. These plots show precisely an issue for the cross-correlation, and not for the auto-correlation:

the diagonal terms of the "Lya" mocks are stronger than for the "Lya+Continuum", this is abnormal, it should be the contrary
the correlation between pixels doesn't drop at large separation.

Conclusion:

The correlation in each sub-sample might not go to zero. I will have a look.

xcf_covariance_matrix_covar

xcf_covariance_matrix_corr

cf_covariance_matrix_covar

cf_covariance_matrix_corr

londumas commented 6 years ago

If I leave the projection in the computation of the correlation function, I get a positive definite matrix:

xcf_covariance_matrix_corr_projected

xcf_covariance_matrix_covar_projected

londumas commented 6 years ago

Sadly, the reason is linked to something I was expecting: the mean of delta in each sub-sampling is not random. To investigate this I ran the regular xcf on the transmission files, without projection for the catalog of quasar and the random catalog.

the first plot gives the mean all the bins of the cross-correlation in each sub-sample (1214 points): in the randoms versus in the data. We can see the correlation between the mean delta
the second plot gives the correlation-matrix for "data", "random" and "data-random". We can see that the "data-random" is what we are used to and is positive definite
the third plot gives a part of the correlation-matrix for "data"
the fourth plot gives a part of the correlation-matrix for "data-random"

The reason might be that: xi = \<delta eta> - \<delta>\<eta> Basically in do_xcf.py, we compute only xi = \<delta eta>, the random acts as computing \<delta>\<eta>. What do you think @ngbusca?

correlation_rand_data

correlation_xcf_xcfrand_xcfminus

correlationmatrix_xcf_trans

correlationmatrix_xcfminus_trans

ngbusca commented 6 years ago

@londumas what happens if you do the projection on the transmission field? Is it still not positive definite?

londumas commented 6 years ago

@ngbusca, With the project on the transmission filed, I get posititive definite matrices, with the correlation matrix we are used to, see comment https://github.com/igmhub/picca/issues/431#issuecomment-390805569

ngbusca commented 6 years ago

I see, thanks!

londumas commented 6 years ago

Here is the same plot as the first one but for xcf calculated on projected transmission, the correlation between random and data has totally disappeared and the values of \<delta> are way smaller.

correlation_rand_data_projected

andreufont commented 6 years ago

This thread is really interesting. Thanks Helion!

We should think more about different estimators, and test them in mocks. Having picca is great, but it does make it harder to do code comparisons and test new estimators... In the past we had some discussion on whether we should use quasar randoms in the cross-correlation or not, and whether that made estimators more optimal.

I still think that what we do now, i.e., stack as a function of quasar position is a fine thing to do, but it is possible that it makes the covariance matrix more difficult to handle...

By the way, what happened to the Wick covariances? I understand that you are currently computing the covarinces from sub-sampling, right? Are we still able to compute covariances based on 4-point functions? It would be interseting to see whether those are also pathological.

ngbusca commented 6 years ago

@andreufont the Wick covariance code is there but we haven't used it in a while. I plan to take a look at that again for our final round of eBOSS papers.

andreufont commented 6 years ago

Sounds great. Thanks!

andreufont commented 5 years ago

@londumas - The issues that @jfarr03 and I are having is that we have non-positive definite matrices when trying to use picca to measure the correlation function of the Gaussian field, the log-normal density field, the optical depth or the redshift-space optical depth.

It would be really interesting to run these, since it would allow us to do non-trivial tests. For instance:

in the case of the log-normal density field we should measure (b_delta = 1, b_eta = 0)
in the case of the non-RSD optical depth we should measure (b_delta = 1.6, b_eta = 0)
in the case of the RSD optical depth we should measure (b_delta = 1.6, b_eta = 1)

londumas commented 5 years ago

@andreufont and @jfarr03, do you have that only for the cross or also for the auto? On the raw transmission files I don't get it for the auto, only the cross has issues.

londumas commented 5 years ago

The following message is an update of the previously posted ones, using the latest simulations. The plots show the following:

the first plot show the covariance matrix of the cross-correlation, we can see that it is very non-diagonal for both the catalog and a random shuffle of the redshift. The covariance is not positive definite.
the second plot shows the correlation of \<xi> in each healpix pixel of the correlation. Each point is one healpix and the value is the mean of the cross in this healpix. This plots shows that the mean delta is not-zero in each healpix pixels. (This bring back bad memories from my thesis.). What is nice is that using a shuffle of the position of the quasar catches this correlation.
The third plot gives what we get if we remove the mean delta using the shuffled catalog from the true cross-correlation function. The covariance is diagonal and is positive definite.

correlation_matrix_d_shuffle

correlated_delta

new_covariance_matrix

Sending the following:

picca_xcf.py --drq /project/projectdirs/desi/mocks/lya_forest//london//v6.0/v6.0.0/eboss-0.0/zcat_desi_drq.fits --in-dir /project/projectdirs/desi/mocks/lya_forest/picca/london//v6.0/v6.0.0/eboss-raw/deltas/ --z-evol-obj 1.44 --out /project/projectdirs/desi/mocks/lya_forest/picca/london//v6.0/v6.0.0/eboss-raw/tests_no_projection/xcf_z_0_10.fits --nproc 32 --fid-Om 0.3147 --no-project --nspec 100000

picca_export.py --data /project/projectdirs/desi/mocks/lya_forest/picca/london//v6.0/v6.0.0/eboss-raw/tests_no_projection/xcf_z_0_10.fits --out /project/projectdirs/desi/mocks/lya_forest/picca/london//v6.0/v6.0.0/eboss-raw/tests_no_projection/xcf_z_0_10-exp.fits

picca_xcf.py --drq /project/projectdirs/desi/mocks/lya_forest//london//v6.0/v6.0.0/eboss-0.0/zcat_desi_drq.fits --in-dir /project/projectdirs/desi/mocks/lya_forest/picca/london//v6.0/v6.0.0/eboss-raw/deltas/ --z-evol-obj 1.44 --out /project/projectdirs/desi/mocks/lya_forest/picca/london//v6.0/v6.0.0/eboss-raw/tests_no_projection/xcf_z_0_10_shuffle_42.fits --nproc 32 --fid-Om 0.3147 --no-project --nspec 100000 --shuffle-distrib-obj-seed 42

picca_export.py --data /project/projectdirs/desi/mocks/lya_forest/picca/london//v6.0/v6.0.0/eboss-raw/tests_no_projection/xcf_z_0_10_shuffle_42.fits --out /project/projectdirs/desi/mocks/lya_forest/picca/london//v6.0/v6.0.0/eboss-raw/tests_no_projection/xcf_z_0_10_shuffle_42-exp.fits

andreufont commented 5 years ago

Hi @londumas - I remember dicussing about randoms in cross-correlation a while ago (actually, this was first discussed by Ross O'Connell back in 2012), it might be time to properly look at this with mocks and see what is the statistical gain / loss when you take into account the randoms or not. It looks like it helps with the covariance matrix (what is great!) but I'm curious to see the end effect on BAO uncertainties. It might be worth a paper, don't you think? I could ask a student to work on this.

londumas commented 5 years ago

@andreufont, yes a study with a student would be good. Maybe we should change our estimator.

statistical gain / loss when you take into account the randoms or not.

However there can't be any loss of statistical power because you can produce many realization of the random shuffles and get a near noiseless stack. I bet that 2 or 4 already are embedded into the real catalog errors.

londumas commented 5 years ago

This is fixed by using the random shuffle, see the following plot with 10 London mock realizations: raw and raw+continuum. Fixed via PR https://github.com/igmhub/picca/pull/521

igmhub / picca

xcf covariance matrix of transmission field is not positive definite #431