skyfielders / python-skyfield

Elegant astronomy for Python
MIT License
1.43k stars 213 forks source link

Downloaded files should (maybe) land in a per-user directory instead of CWD #128

Closed pkgw closed 6 years ago

pkgw commented 7 years ago

Skyfield currently downloads data files to the current directory. It might be better to download those files to a per-user cache directory so that multiple applications using Skyfield can share data, and that you don't end up with a bunch of .bsp files lying around the filesystem if you run Skyfield-based code from a variety of starting directories.

The Python appdirs module is a standard way for determining per-user cache directories. Alternatively, astropy has freestanding code to come up with a cache directory path in its config.paths module.

pkgw commented 7 years ago

There is some early discussion of this issue in #126, which was intended to be about a related but different topic. CC @davidmikolas @tritium21.

Re @davidmikolas's comment in #126: is the use case of "inexperienced computer user ... running Skyfield on a Raspberry Pi" really something that makes sense?

davidmikolas commented 7 years ago

While applications come with hardware requirements, it would be a shift in Skyfield's scope of users if it could potentially fill up 1 or 2 GB of hard drive space without the user doing it themselves, and without some users knowing about it. It's a non-trivial amount of permanent storage. If it's something that can be turned on by the user, then its fine. If it just starts happening and the location isn't clearly defined and readable in a script, there could be problems for some.

Can this be done in a way that is easily understandable to all users of all levels of skill, and so that they all have an easy way to find out how much permanent storage this one python package is using?

Side note about the partial quote: What does not "make sense" about running on smaller computers or controllers of astronomical or other hardware or even embedded devices?

pkgw commented 7 years ago

Some examples of packages that default to saving downloaded data in a per-user directory:

If you're concerned about it not being clear where the files land, the default message about files being downloaded could mention the target directory. Of course this doesn't help if that message is being suppressed.

It's not that it doesn't make sense to run Skyfield on embedded devices, it's that a user who's capable of setting up a Raspberry Pi with Skyfield and running their own code on it seems quite unlikely to be an inexperienced computer user.

davidmikolas commented 7 years ago

A message might work, except when it's not there, as you point out. But it might be a good idea. PyENCODE has a default value for cache_dir if you don't specify one. That behavior could be added to Loader(). Astronomy is one of those fields that attracts a lot of lay people and young people - people who inherit a RPi from someone, or just ask their programmer-friend to set it up for them. People who will never know of the astropy 'ecosystem'.

brandon-rhodes commented 7 years ago

I would be happy to expand the documentation to make more prominent mention of the optional argument to Loader('~/data-dir') so that people are more likely to take advantage of it to prevent duplicate files from being downloaded! Please simply suggest where in the docs that advice should be repeated (since you might think of places to put it that I won't), and I'll add it in.

I am not likely to make something like ~/data-dir the default, however, because even among professional programmers I have run into problems with files that are downloaded to unexpected locations. Even if you print the locations out, it's generally a month later that they run into a full or nearly-full disk and by that point (a) they have forgotten where the files went and (b) might even by that point have forgotten they ran a Skyfield command that used up lots of disk. I would prefer for them to be able to find the big files by doing a search in the area they have been making and running projects and scripts, instead of hiding them off somewhere else where they themselves will not remember having put files. I have seen too much confusion, even in the past few weeks, from programmers faced with that situation!

But more docs are a good idea, so please suggest all the places I should mention the idea!

Three more thoughts.

  1. Putting the files in a true cache directory would be a bad idea, because real caches get cleaned by both tools and users, and it's not clear that people want to be having to randomly re-download big ephemeris files at random moments because a cache was cleaned.
  2. I wonder what would happen if we had the rule that we auto-detect and use a directory, if you've opted in by creating one? Like "you can create ~/skyfield-data or ~/.skyfield-data and we'll auto detect and use that if you have it." That way it's opt-in, but doesn't require every Skyfield script they run to have to supply the argument to Loader().
  3. But: what happens when someone writes a script that says load('de421.bsp'); os.stat('de421.bsp') that works for them, but then breaks — but only for a fraction of users — when run on other folks’ systems? I am wary of the idea that a library should change its behavior based on information sitting around on a system that might vary between users. That can be okay for applications, but tends to go more poorly for applications.
tritium21 commented 7 years ago

My suggestion in the case of loading if the user loads with a relative path, and there is a magic directory they have opted in to, then... its a matter of creating a search path. Is it in the magic directory? Is it in cwd? Is it in $HOME (...i'm dubious on this one). If the path given is absolute, you load that file, and only that file and raise loudly if its not there.

I would look at the appdirs module, and see where they store application data for where a magic directory would go. Yes, skyfield is a library not an application, but i think its a good starting place.

For this kind of thing, I think the golden rule is "What would annoy the sysadmin the least" is possibly the best solution.

JoshPaterson commented 6 years ago

What if the docs just encourage users to store their ephemerides in a central place manually and then reference them locally? When I need a new ephemeris I either let skyfield download it to my cwd and then I move it to my central ephemeride folder by hand, or I just download it manually with my browser. There aren't that many different ephemerides so I don't find it to be a problem to do this.

Most of my load commands look something like this:

load(r'C:\Users\Josh\Scripts\Ephemerides\de430t.bsp')

The advantages of this are that the ephemerides can't take up a lot of space without me knowing, and if I send someone else a script it should be pretty clear that this line would need to be changed and why.

brandon-rhodes commented 6 years ago

Given that the current version of the documentation puts this idea in one of its first sections:

http://rhodesmill.org/skyfield/files.html#specifying-the-download-directory

— I am going to close this issue for now. Though I understand the reasons behind the idea, I do not plan to make a special data directory the default for all users.