msimet / Stile

Stile: the Systematics Tests In Lensing pipeline
BSD 3-Clause "New" or "Revised" License
9 stars 6 forks source link

#18 Scatter plots #29

Closed HironaoMiyatake closed 9 years ago

HironaoMiyatake commented 10 years ago

This pull request is for adding a feature to generate scatter plots. It includes the following things.

1) A ScatterPlotSysTest class in sys_tests.py, which is a base class inherited by specific scatter plot classes for three quantitates for test (g1, g2, or sigma) times two quantities for y-axis (a star quantity or a residual quantity), in total 6 classes whose names are ScatterPlot[Star, Residual]VsPSF[G1, G2, Sigma]SysTest.

2) An interface to the HSC pipeline is added in hsc/sys_test_adapters.py. Each test can be run by adding ScatterPlotStarVsPSFG1, ScatterPlotStarVsPSFG2, ScatterPlotStarVsPSFSigma, ScatterPlotResidualVsPSFG1, ScatterPlotResidualVsPSFG2, or ScatterPlotResidualVsPSFSigma to sys_test in the config. Examples of a command line are as follows

StileCCD.py $SUPRIME_DATA_DIR --id visit=1228 ccd=49 --rerun [rerun_name]:[stile_rerun_name] --clobber-config -c "sys_tests.names=['ScatterPlotStarVsPSFG1', 'ScatterPlotResidualVsPSFG1']"
StileVisit.py $SUPRIME_DATA_DIR --id visit=1228 ccd=49^50 --rerun [rerun_name]:[stile_rerun_name] --clobber-config -c "sys_tests.names=['ScatterPlotStarVsPSFG1', 'ScatterPlotResidualVsPSFG1']"

. For multiple CCDs, mean and std is calculated for each CCD, and linear regression is performed on these quantities. One can disable this feature by adding an option scatterplot_per_ccd=False. I found that sometimes linear regression on the mean and std for each CCD and that on all stars (without using the mean and std) is significantly different. Which should we believe? I guess the latter, but I would like to hear what you think.

rmandelb commented 10 years ago

Hi Hironao -

I will go over the code, but first I have two questions/comments:

I found that sometimes linear regression on the mean and std for each CCD and that on all stars (without using the mean and std) is significantly different. Which should we believe? I guess the latter, but I would like to hear what you think.

I am guessing that what is going on has to do with outliers. Imagine if you have 1 weird star that shouldn't be in your sample (maybe it's a binary star?). When you do a simple mean over the stars on each CCD, the results will be very messed up for the CCD with that star, but it will work okay for the other CCDs. Then when you do the linear regression on the per-CCD means, the effect of that outlier has an influence that goes like 1/(number of CCDs). In contrast, if you do the linear regression over all stars, the impact of that outlier goes like 1/(number of stars), which should be smaller.

If it has something to do with outliers, then one thing that might help is to use the median instead of the mean.

HironaoMiyatake commented 10 years ago

Am I right in thinking that the way these scatter plot tests are set up is quite specific? What if we want to include information about other quantities, or mix-and-match? (example: PSF ellipticity residual vs. PSF size, which mixes ellipticity and size; or, PSF size residual vs. magnitude, which might be useful for studying the bright/fatter effect)

This is now being discussed in #30 .

To answer your question:

I found that sometimes linear regression on the mean and std for each CCD and that on all stars (without using the mean and std) is significantly different. Which should we believe? I guess the latter, but I would like to hear what you think. I am guessing that what is going on has to do with outliers. Imagine if you have 1 weird star that shouldn't be in your sample (maybe it's a binary star?). When you do a simple mean over the stars on each CCD, the results will be very messed up for the CCD with that star, but it will work okay for the other CCDs. Then when you do the linear regression on the per-CCD means, the effect of that outlier has an influence that goes like 1/(number of CCDs). In contrast, if you do the linear regression over all stars, the impact of that outlier goes like 1/(number of stars), which should be smaller.

If it has something to do with outliers, then one thing that might help is to use the median instead of the mean.

Okay, I look into if median works or not. This is using the HSC data, and we might not be able to avoid sending plots, so let's discuss through email, rather than through the public gitHub website.

HironaoMiyatake commented 10 years ago

I went through your comments except for the two above.

HironaoMiyatake commented 10 years ago

I went through the comment. The open issues are

If you agree that what I did or suggested is okay, I would like to merge this branch to master.

rmandelb commented 10 years ago

Hi Hironao - I think your solution for median vs. mean (median as default but user can override this choice) and your choice to set xlim and ylim is fine.

msimet commented 10 years ago

Looks good to me, too, and I'm fine with a merge (but remember to add the tests as defaults to base_tasks.py!)

HironaoMiyatake commented 10 years ago

I was about to merge, but I found an issue.

g1, g2, and sigma are calculated either in pixel coordinates or sky coordinates, depending on if there is a required column 'x' or 'y', which means pixel coordinates, in a certain test. In hsc/base_tasks.py, columns that already calculated are not updated when doing another test. In particular, if whisker plots, which requires pixel coordinates, are generated and then scatter plots, which requires sky coordinates, are generated, g1, g2, and sigma are not updated, and then scatter plots use there quantities in pixel coordinates, not sky coordinates. How should we fix this? I think the easiest fix is using column name (g1_sky and g1_pixel, etc...), but it is not clean...

msimet commented 10 years ago

Hi Hironao,

You're right. I think I implemented a fix for this (commit https://github.com/msimet/Stile/commit/1fad0fde3403b5ac648f1f6b03814842ec11b70f plus bugfixes in https://github.com/msimet/Stile/commit/db7c507dc4c8ab3670af6bbd3b8916b4a96f21a3) but would welcome input.

HironaoMiyatake commented 10 years ago

Hmm, I got the following error

Traceback (most recent call last):
  File "/tigress/HSC/products-20130212/Linux64/pipe_base/HSC-3.0.0/python/lsst/pipe/base/cmdLineTask.py", line 267, in __call__
    result = task.run(dataRef, **kwargs)
  File "/tigress/HSC/users/miyatake/Stile/stile/hsc/base_tasks.py", line 191, in run
    results = sys_test(self.config, *new_catalogs)
  File "/tigress/HSC/users/miyatake/Stile/stile/hsc/sys_test_adapters.py", line 310, in __call__
    return self.sys_test(*new_data, per_ccd_stat = per_ccd_stat)
  File "/tigress/HSC/users/miyatake/Stile/stile/sys_tests.py", line 1019, in __call__
    psf_g1, g1, g1_err = array['psf_g1'], array['g1'], array['g1_err']
ValueError: field named g1_err not found

. Also I need sky and chip quantities for sigma, sigma_err, and psf_sigma. Could you add these?

msimet commented 10 years ago

Hm. I will look into that error.

Are sigma, sigma_err, and psf_sigma different for sky and chip coordinates? They're g^2 quantities, so I wouldn't think the orientation of the shears matters. Are they normalized differently in the different systems?

HironaoMiyatake commented 10 years ago

Thanks! The sigma family has units, so it should be taken care of properly depending on if you are working on sky or chip coordinates.

msimet commented 10 years ago

Hey Hironao, sorry for the delay, but I believe I have fixed this. Probably the new code could use a look-over by somebody else though (or multiple somebody elses).

HironaoMiyatake commented 10 years ago

Thanks! I fixed some bugs. Please check if you are happy with this fix.

msimet commented 10 years ago

Yes, all of those edits look fine to me.

HironaoMiyatake commented 10 years ago

Now I am happy with the codes! Do we wait for somebody to review the code, or shall we merge this branch into master?

rmandelb commented 10 years ago

I was going to look at it this weekend.

msimet commented 10 years ago

Cool, that would be great!

HironaoMiyatake commented 10 years ago

Great! Thanks!