thesofakillers / nowcastlib

πŸ§™β€β™‚οΈπŸ”§ Utils that can be reused and shared across and beyond the ESO Nowcast project
https://giuliostarace.com/nowcastlib
GNU General Public License v3.0
0 stars 0 forks source link

skyfield calculations may be overly accurate for requirements at the cost of computation #4

Open thesofakillers opened 3 years ago

thesofakillers commented 3 years ago

Have not analyzed big O performance but it is slow enough to be a nuisance for larger datasets, especially since this calculation needs to be repeated across train/test sets and perhaps even across folds.

The following lines need addressing: https://github.com/thesofakillers/nowcastlib/blob/b24e0d83e553fc8c438c619cb5f090f2f5ed70c3/nowcastlib/pipeline/process/postprocess/generate.py#L130-L133

thesofakillers commented 3 years ago

Upon closer look, while the lines pointed out above indeed need addressing, the skyfield library seems generally to be quite slow for what we are asking it to do. Both the _t_since_sunset and _sun_elevation functions are quite slow for large datasets, with the latter taking the crown, with these lines also causing speed issues:

https://github.com/thesofakillers/nowcastlib/blob/f054704f84a91c5f4054101a4f0ec31961712604/nowcastlib/pipeline/features/generate.py#L182-L186

Skyfield was selected because the popular library pyephem suggests it.

Further attention should be dedicated into determining whether

brandon-rhodes commented 3 years ago

I'll be interested to hear which component in that call has the higher cost. Is there a way you can produce that series of times, I wonder, without creating n separate Python datetime objects? If the times are evenly spaced on the clock or calendar there are other approaches in Skyfield for creating the time array.

In the meantime, if you split your call into two lines with a temporary variable:

dtseries = datetime_series.dt.tz_localize("UTC").dt.to_pydatetime()
result = skyfield_ts.from_datetimes(dtseries)

β€” then could you could separately measure the cost of the Pandas operation from the Skyfield operation.

thesofakillers commented 3 years ago

@brandon-rhodes Hi thanks for commenting! Wasn't expecting the library creator to come and help πŸ˜…

The times are indeed evenly spaced so this could be done, as you say, with other approaches from skyfield. That was just a quick and dirty solution without having to dig too deeply into the documentation.

I've updated my second comment with the complete section of code causing problems. Evaluating line by line as you suggested sheds some light on what operations are more costly. You can see my experimentation in this notebook, where i use %lprun to profile the same function with different input time series.

While for small time series the pandas operation is more costly than skyfield's from_datetimes, this flips for larger time series. More importantly however, the greatest slow down is caused by the location.at() call, which we can see takes between 60% and 80% of the execution time depending on the input size.

In the notebook the time series I try on are of length 1000, 10000 and 100000. I wasn't able to try it on 1000000 since it would not complete execution. For this particular project, time series of this size (millions of timestamps) are what we are concerned with.


Thanks again for the interest! I hope this satisfies your interest :).


We would really appreciate if you have any suggestions on how to do this in a smarter way! We're trying to get the sun's altitude at each timestamp for millions of timestamps, with the timestamps being evenly spaced (regular sample rate basically).

We are also interested in determining the most recent sunset for each timestamp (code here) although this has been less problematic

brandon-rhodes commented 3 years ago

My guess is that most of the time is invested in computing precession and nutation β€” the precise angle at which the Earth's poles point at any given moment. The standard nutation series, in particular, has 687 terms of 5 coefficients each which need to be evaluated at every point on your timeline, so if you have one million times then that’s something like 5Γ—687Γ—10⁢ multiplies and then all the necessary adds to put them together.

What kind of accuracy are you after? The full series is only useful if you're going for observatory-grade angles that are useful for pointing a radio telescope (which is the accuracy Skyfield aims for by default).

Since your times are so close together, my guess is that we could evaluate nutation and precession daily, or only a few times a day, and linearly interpolate between them, to get better speed. But that will introduce error, so we need first to know how exact an answer you need on the Sun's position in the sky.

What will you be doing with the angles? That can help establish tolerances.

thesofakillers commented 2 years ago

What will you be doing with the angles? That can help establish tolerances.

We are using the angles as additional input data to provide to machine learning models to be used for turbulence forecasting.

What kind of accuracy are you after?

I have asked my colleagues what sort of accuracy we are after and will update this post

Since your times are so close together, my guess is that we could evaluate nutation and precession daily, or only a few times a day, and linearly interpolate between them, to get better speed.

This seems like a good idea and depending on our tolerance it may be sufficient to go this route. I imagine this will be the case.

Thank you again for all the help, let us know if you have any more comments 😊