Feature request: create Catalog using derived quantities

theoschutt commented 1 year ago

As we discussed earlier, this is a feature request for creating a Catalog with quantities derived from the input catalog. This would allow the user to avoid making in advance a separate catalog with the derived quantities as columns.

One simple use case is subtracting the mean from an input column before running the correlation. This alone as a boolean flag would be useful. The more general use case would be allowing any operation on one or more columns. For example, G1_DATA - G1_MODEL or (G1_DATA - G1_MODEL)*(T_DATA - T_MODEL)/T_DATA, as we need to do for rho and tau statistics.

Something like allowing:

func1 = lambda input: input['G1_DATA'] - input['G1_MODEL']
func2 = lambda input: input['G2_DATA'] - input['G2_MODEL']
qcat = treecorr.Catalog(filename="catalog.fits", g1_func=func1, g2_func=func2, ...)

where the column names defined in the functions, G1_DATA, G1_MODEL, etc, must be columns in catalog.fits. I'll keep thinking on it, and happy to help out with implementing!

rmjarvis commented 1 year ago

Counter-proposal:

extra_cols = ['G1_DATA', 'G2_DATA', 'G1_MODEL', 'G2_MODEL']
qcat = treecorr.Catalog(filename="catalog.fits", g1_eval='G1_DATA-G1_MODEL', g2_eval='G2_DATA-G2_MODEL', ...)

The issue is that TreeCorr's I/O currently wants to know the names of all the columns to read in at the start. Mostly in case the input catalog has tons of columns, it only reads the ones it will actually use. (fitsio and hdf5 can both be efficient at this.) I think this way could be made to work in that manner. We'd add these extra column names to the all_cols list, and then those variables would exist for the evals to use.

theoschutt commented 1 year ago

I see. This makes sense to me!

rmjarvis commented 7 months ago

Done on #173

rmjarvis / TreeCorr

Feature request: create Catalog using derived quantities #151