skyfielders / python-skyfield

Elegant astronomy for Python
MIT License
1.41k stars 211 forks source link

Data files are loaded at the first runtime #267

Closed brunobord closed 5 years ago

brunobord commented 5 years ago

Rationale

As a workalendar maintainer, I'd like to drop ephem from our requirements in favor of skyfield. Workalendar is a Python library and toolkit to compute holidays for more than 230 calendars around the world. To determine some of these holidays, we need to know the date of the Spring Equinox, for example.

But my main issue here is that the following snippet will take from several seconds to a few minutes at the first run, depending on your network speed:

from skyfield.api import load
load('de421.bsp')

Hopefully, at the subsequent runs, the corresponding files will be found on the disk and won't be downloaded, so it'll run much faster... until the files expire, so we'll be back for a download dance again.

And you're prone to various problems when relying on an online datasource: network can break, distant servers can be unaccessible, etc... When you're in the middle of running a program, this might be an issue.

I'm also wondering what would happen if we're in a multi-user environment (a web application?), what happens if several threads are requesting the same files? wouldn't they be downloaded several times, once for each thread?

Ideas / solutions

1 - Include the files in workalendar's package

This has major drawbacks:

2 - Download the files at the setup step

Pros: the package archive would keep its modest size.

Cons: It would mean that our setup.py would include scripting. This is not necessarily a blocker here, but it's not apparently where the Python community is leaning towards, since "modern" projects prefer to use a declarative setup.cfg.

3 - Build a workalendar-data package

Cons:

Pros:

Related issues

I have no idea (yet) of the license of these files, if I'm authorized to embed them in a Python package and publish them on PyPI.

If it's possible, I think that it could be an interesting opportunity to add this "data package" to the skyfielders organization, and instead of publishing a workalendar-data, why not having a skyfield-data package? I reckon it might be tedious, because there are a ton of USNO files, and it would be complicated to know where to stop adding data files.

Conclusion

I'd like to know skyfield maintainer (@brandon-rhodes) opinion on this subject. As far as I can tell, I think that the 3rd solution seems the best, but it raises issues, so it seems not that simple.

Clear skies!

brandon-rhodes commented 5 years ago

Here are some initial thoughts:

  1. PyPI is not historically fond of Python packages with embedded binary data; it's not what their infrastructure is set up for, and is costly to them.
  2. My own initial attempts to package ephemerides as Python packages strained their infrastructure and were not a good fit for the pip infrastructure, I didn't think.
  3. So I pivoted to raw files.

I have recently suspected that it was a poor choice for me to include auto-download in the load() function. Maybe I should have just had folks grab the files on their own? The loading code does try to use file locks to prevent n threads from pulling n copies of the file, but, really, the best approach is to make sure the file is present before calling Skyfield.

Maybe I should work to re-expose Skyfield's logic as two pieces: one that would turn "I want DE421" into a URL, and then a second piece that loads the file from disk, and encourage people to do the download of the URL their own way?

In any case, my guess is that you'll want users to get the file themselves, or have a little script you offer them to grab it. None of your options is likely to be a good fit — # 1 and # 3 because they impose costs on PyPI, and # 2 because, as you note, downloading and storing a file somewhere on disk isn't the job of the setup script.

I'd give the application writer the job of getting the file. Give them the URL, and have them call your init function where they activate your library with a path to where they downloaded the file. It's generally trouble to try to guess how the top-level app will prefer to have I/O take place (for example: the network might not be available during setup.py, depending on whether they're in a container, and what its network settings are).

mworion commented 5 years ago

I would like to add some remarks as well:

As I do a larger application build on sky field, I need beside the de421.bsp file some more. Especially the time related files like Leap_Second, deltat parts and in addition as I would like to show some stars the hippacros catalogue as well. Finally the intension of using the skydield framework was to show satellite information. Even therefore additional data needed. So in total a lot of files needed to be downloaded.

Some of the data outdates over time (deltaT, leap second) after some months, the satellite data after some days. From my point there is no real way around downloading them in the application.

Here my path: Download was organized when first started the application. This took some time. After that I show outdating data to the user. He could decide if he wants to have expiring on or off, meaning you need initial connectivity for first install, but than could work also off-line as long it is acceptable for you.

I'm thinking of including a first set of files in the application. This increases the size, but it is anyway about 50MB, so still acceptable. I distribute it under releases on GitHub, this supports the size currently. To start with distributing the files, the license an other aspects have to be sorted out.

Michel

brunobord commented 5 years ago

PyPI is not historically fond of Python packages with embedded binary data; it's not what their infrastructure is set up for, and is costly to them.

Granted, yes. But I'm also investigating about this and I've found that the package file size limit is allegedly 60MB. What would be necessary for workalendar to work is just about 17~18MB (de421.bsp is sufficient enough, IMO), so the package would weight way below the standard limit.

brandon-rhodes commented 5 years ago

Oh, they have a size limit now? Then it makes perfect sense for you to distribute binary data as long as you stay within their guidelines. Take a look at load_bundled_npy() in skyfield/functions.py if you want to see how Skyfield itself access a distributed data file.

So it sounds like Option # 1 will be simplest? Or are you going to try Option # 3?

Bernmeister commented 5 years ago

@brunobord In ticket #123 I gave a run down on how to make a subset of de421.bsp which may help.

brunobord commented 5 years ago

I've started to work on a skyfield-data package, but I've been disturbed by astronomical calculations. In workalendar, some asian calendars are using solar terms to determine holidays, and (py)ephem does a great job about them. Alas, with skyfield, I haven't been able to have an iso-functional solar term function (and yes, I've struggled a lot).

I think I'll open a separate issue here, but if I can't replace ephem, I'm afraid it'll slow down my "data" initiative.

brandon-rhodes commented 5 years ago

@brunobord I am not familiar with solar terms, but if you would like to share sample PyEphem code that computes the value you need, I might be able to suggest how to do the same in Skyfield.

brunobord commented 5 years ago

well... You'll see that the code I'm using is... yours :o)

brunobord commented 5 years ago

Introducing: Skyfield Data!!!!

See the usage in the README file (and the project page on PyPI). The next improvement(s) will be:

But... it looks like it's working on my side, so that's great.

brunobord commented 5 years ago

Closing... :o)

brandon-rhodes commented 5 years ago

Thanks for the update!