Update Picca for DESI: modify use of PLATE in many pieces of code

andreufont commented 4 years ago

Picca can now work with simulated DESI data, and @corentinravoux has added a branch to work with DESI mini-sv. However, we can only do that by hacking code, since Picca requires us to create a fake value of PLATE, and probably other SDSS-based concepts that should only be optional.

Now that the eBOSS analysis is finished, we should reconsider how to allow for a more natural way of working with both SDSS and DESI data. However, we should always make sure that we do not break any backward compatibility, since SDSS data will still be very relevant for years to come!

andreufont commented 4 years ago

One option here is to use "observation" and "spectrograph" instead of "plate" and "fiberid", since this is really what we care about in Picca. We were only storing "plate" and "fiberid" in order to figure out later on whether two pixels belonged to the same "half-plate", i.e., the same spectrograph in the same observation. This is important to make study correlated noise in the spectra, coming from sky subtraction or flux calibration errors.

"observation" could be a string, so that when using SDSS we can store PLATE-MJD, and when using DESI we can store TILE-NIGHT, or whatever. And then "spectrograph" could be an integer or a string, either "1" or "2" for SDSS, or 0-9 for DESI, whatever. We could discuss this with the data team, may be @julienguy has good inputs here.

julianbautista commented 4 years ago

This is my idea I started implementing on #745

1) Quasar catalog: read it with astropy.table.Table.read(), keeping the columns that are useful for us (easy to change), and pass it to the functions that read the spectra. Each of these functions will pick the metadata they need from the catalog, e.g., a read_from_desi will search for TARGETID, TILEID, PETAL_LOC, etc.

2) Saving this metadata as a dictionary in the class Forest. When reading and coadding data from multiple observations, one could add TILEID_0, TILEID_1, PETAL_LOC_0, PETAL_LOC_1, etc to the dictionary. This can be easily saved in the delta files as a binary table or new header keywords. This would change the constructor from

def __init__(self,
                 log_lambda,
                 flux,
                 ivar,
                 thingid,
                 ra,
                 dec,
                 z_qso,
                 plate,
                 mjd,
                 fiberid,
                 exposures_diff=None,
                 reso=None,
                 mean_expected_flux_frac=None,
                 abs_igm="LYA")

to

def __init__(self,
                 log_lambda,
                 flux,
                 ivar,
                 metadata,
                 exposures_diff=None,
                 reso=None,
                 mean_expected_flux_frac=None,
                 abs_igm="LYA"):

3) For the correlation functions, one can again easily read this metadata from the delta files and store them back as dictionaries. Other than RA, DEC, Z, the other metadata would only be used if one wants to exclude pairs of same wavelength/spectrograph/observation, in which case one option would be added.

andreufont commented 4 years ago

Thanks @julianbautista, I think this is very helpful.

I like the idea of hiding survey-specific information (fiber, plate, mjd...) in a metadata dictionary, specially if these are quantities that are not required for the main analyses.

If there are quantities that are required by most analyses (like RA, DEC or Z_QSO), it might make more sense for these to be passed explicitly in the constructor, right?

julianbautista commented 4 years ago

Yes for the Delta class used in the correlation function.

For the continuum fitting we only need Z_QSO to cut the forests in the good ranges and to compute rest-frame wavelengths. We can pass it directly.

andreufont commented 4 years ago

For the Forest object, we also need always RA, Dec because we always need to write this in the Delta files.

I think we should use metadata (or similar name) to collect information that is not always necessary, only when running with particular options that are not default.

julianbautista commented 4 years ago

Yes, but we will write everything in metadata anyway :)

On Wed, 30 Sep 2020 at 16:36, Andreu Font-Ribera notifications@github.com wrote:

For the Forest object, we also need always RA, Dec because we always need to write this in the Delta files.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/igmhub/picca/issues/700#issuecomment-701470238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPMWSYF2HHLEJG653XXY73SINGA3ANCNFSM4MITMYKA .

andreufont commented 4 years ago

DESI might have TARGET_RA while SDSS has RA or similar, so if there is a key piece of information that we need to have, by passing it as metadata we make it harder to access later on. If we pass it in the constructor of Forest, then we can store Forest.RA and it would work for all surveys.

julianbautista commented 4 years ago

Yes, in my branch I modify the column TARGET_RA and TARGET_DEC to be simply RA and DEC.

andreufont commented 4 years ago

You mean you do not pass the metadata from the original quasar catalog, but create your own version of the metadata?

I think it would be cleaner to either store the original metadata (using SDSS names or DESI names), or to move it from metadata to a standard member variable, like Forest.RA. Otherwise we would have to document our own metadata.

julianbautista commented 4 years ago

I keep all columns names from the catalogs for now, literally only changing the names of RA and DEC. Of course this is easy to change at any point.

andreufont commented 4 years ago

I think the fact that you had to change it for RA,Dec is telling you that this should not be some optional meta-data, but key variables that need to be always declared. I could open a new issue about this if you preferred.

julianbautista commented 4 years ago

It's fine, I'm happy with any implementation of this.

andreufont commented 2 years ago

@Waelthus , @iprafols - this is probably one of the next big things we should tackle in Picca. There is still left-over SDSS terminology spread all over. Maybe we should ask a volunteer to tackle this?

iprafols commented 2 years ago

This requires quite a bit of thought. From my point of view, it means doing something similar to the delta_extraction branch but for the computation of the correlation functions. There is currently much code repeated between the cf.py and xcf.py files that makes maintaining and updating the code troublesome. It would be great to find a volunteer to start thinking about how to do this and once we have a clearer picture of the changes needed, see how we should address them

andreufont commented 2 years ago

We could try to do a major reshuffling of the correlation measurements, but I was hoping that we can do something about the SDSS variables in a much shorter timescale. Let's talk about this at a Picca call.

Waelthus commented 2 years ago

We could try adding the new names (eg los_id) everywhere relevant in the shared backbone (e.g. io.py/data.py...) without removing the old names (e.g. thingid which is currently used at >200 positions in >20 files) just yet. That would allow us to still use all routines, but also enable us to change things gradually. Naively, I'd assume that a major rewrite of cf related business might take too long to be usable even for y1 analyses, so we should try to get things cleaned up before or during the rewrite...

andreufont commented 2 years ago

@Waelthus, I like this approach. We can progressively clean up and move towards the new naming convention without breaking the existing code.

Waelthus commented 9 months ago

this will be fixed with Lya2pt

igmhub / picca

Update Picca for DESI: modify use of PLATE in many pieces of code #700