lincc-frameworks / nested-dask

Connector project to enable Dask on Nested-Pandas
https://nested-dask.readthedocs.io/en/latest/
MIT License
5 stars 1 forks source link

Science Driver: Running a Periodogram over all ZTF Lightcurves #42

Open dougbrn opened 3 months ago

dougbrn commented 3 months ago

Describe the desired workflow.

Including details and what needs to be run, on what system (if relevant), and any technologies that will be used alongside Nested-Dask

This driver is straightforward, in that we simply want to run a common analysis function at the scale of ZTF (~4.5 Billion Lightcurves). Periodogram (likely just a single band via Astropy's implementation) seems like a sensible choice for it's ubiquity. It's preferred to run this on some kind of distributed system, like the PSC or Fornax. It's also preferred to do this analysis with LSDB as well.

How will doing this driver create impact?

Does this enable scientific work that wasn't possible (or just difficult) before? Will this test the scalability and robustness of Nested-Dask/Nested-Pandas?

The sole impact of this is to assess the scalability limitations of our current implementation of Nested-Dask/Nested-Pandas. Will we be able to get through the full workflow and what issues will we encounter?

Does this require any new functionality to be added to Nested-Dask?

E.g. Are there API functions needed that are not present (to the best of your knowledge)? Independent tickets should be created for these features and linked back to this issue.

We should be able to do this with the current functionality.

Should this produce documentation?

Can we capture the result of this driver in some way? For example, as a tutorial or longer-form notebook (held in a different repository)

This should be kept at minimum as a long-form notebook in a different repository (notebooks-lf), or directly within the Nested-Dask and/or LSDB docs. There is potential for this to inform some new best practices for working at scale in our main documentation.

dougbrn commented 2 months ago

The notebook that will be run for this is here: https://github.com/lincc-frameworks/notebooks_lf/blob/main/ztf_periodogram/ztf_periodogram_epyc.ipynb

A spreadsheet for run tracking is here: https://docs.google.com/spreadsheets/d/19-GexwAu1TBunGKkCNMU7c3uwhbcLwZXteLLvT3Q48w/edit?gid=0#gid=0

Timing for above is centered on how long it takes to run the histogram plot cell towards the end.

dougbrn commented 2 months ago

@wilsonbb will be testing on epyc with the locally available ztf_axs dr14, with a switch to dr20 when available @hombit will be testing on psc after downloading ztf dr20 to it @dougbrn will be testing on a local macbook via https

nevencaplar commented 2 months ago

Please also test with baldur (which can go ``epyc'').