nanograv / PINT

PINT is not TEMPO3 -- Software for high-precision pulsar timing
Other
118 stars 101 forks source link

Don't download datafiles in setup.py #191

Closed cdeil closed 5 years ago

cdeil commented 7 years ago

You're currently doing downloading data files in setup.py: https://github.com/nanograv/PINT/blob/master/setup.py#L43

# TODO: need to check if datafile have all the datafile inplace. We do not want
# to download those files everytime we reinstall PINT
# Download data files
data_urls = [
        "ftp://ssd.jpl.nasa.gov/pub/eph/planets/bsp/de405.bsp",
        "ftp://ssd.jpl.nasa.gov/pub/eph/planets/bsp/de421.bsp",
        "ftp://ssd.jpl.nasa.gov/pub/eph/planets/bsp/de430t.bsp",
        "http://naif.jpl.nasa.gov/pub/naif/generic_kernels/pck/earth_latest_high_prec.bpc",
        "http://naif.jpl.nasa.gov/pub/naif/generic_kernels/pck/pck00010.tpc",
        "http://naif.jpl.nasa.gov/pub/naif/generic_kernels/lsk/naif0012.tls",
        "http://naif.jpl.nasa.gov/pub/naif/generic_kernels/pck/de-403-masses.tpc"
        ]
data_files = []
data_dir = 'datafiles'
for u in data_urls:
    os.system("wget -N -c -P pint/%s %s" % (data_dir, u))
    data_files.append(os.path.join(data_dir,u.split('/')[-1]))

Like the TODO comment says, that's no good, and I'd like to help implement something better.

The question is how exactly to do it.

How about using a shell environment variable PINT_DATA instead that the user has to set, that points to the directory where the files shall be located. And then to provide a download_data function in pint, and load data via PINT_DATA from examples and the tests?

That's what we do in Gammapy for now. It has the advantage that it decouples the data from your software (setup.py), so it can also be done e.g. if you publish pint on PyPI some day and someone wants to use the 200 MB of data files with an installation coming from there.

Thoughts?

demorest commented 7 years ago

Hi @cdeil , we purposefully removed a PINT environment variable a while back, I don't think we want to put it back in (at least, I don't). For background see discussion in #115 . These are a few different types of data PINT needs. Some (like IERS data) are always being updated and we use the astropy download system for these. The ones currently in setup.py should only need to be downloaded once. Changing the download scheme for these would be fine, but I think it would be good to have a system-wide install (rather than per-user) by default.

cdeil commented 7 years ago

Any chance to get a change here shortly?

I think there was some other issue discussing a proper long-term solution for config handling in PINT, which is fine and probably will take quite some time (because it's hard to make it work for the many different use cases how people will install and use PINT and download and update data).

My main point here is that the file download should be taken out from setup.py. At the moment every time I come to PINT I have to open up setup.py and uncomment the lines that download the files (because they hang for me, because the FTP is blocked from my institute), which is annoying.

So maybe there could be a small change to setup.py soon, moving that file download part to a separate bash or Python script? I'd be willing to send a pull request if that sounds OK to you as a short-term solution.

paulray commented 7 years ago

Is this still an open issue? It works fine for me and it works in travis-ci. If there is some small change that will allow setup.py to run smoothly when you are offline or on a network that doesn't allow FTP downloads or just want to minimize network traffic, that would be useful.

cdeil commented 7 years ago

Is this still an open issue?

Yes.

Don't you see the same thing? Everytime you use setup.py (e.g. to build or install pint), it will re-download all these files and you sit there and wait for 5 min (or longer if you're on bad wifi), no?

So this isn't a problem for you while coding on pint because you never use setup.py?

paulray commented 7 years ago

I do get some downloads when I do setup.py, but it seems like it is mostly cached, and ever since you showed me python setup.py develop --user I rarely need to run setup.py. That develop mode is great!

luojing1211 commented 7 years ago

If the new Solar_system_ephemerides branch gets merged. This would not be downloaded.

-Jing Luo

On Tue, Mar 7, 2017 at 1:20 PM, Paul Ray notifications@github.com wrote:

I do get some downloads when I do setup.py, but it seems like it is mostly cached, and ever since you showed me python setup.py develop --user I rarely need to run setup.py. That develop mode is great!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nanograv/PINT/issues/191#issuecomment-284826997, or mute the thread https://github.com/notifications/unsubscribe-auth/AFFAg8Lt3lh0VaPtSNbNYgNt_6SnfvAzks5rja3mgaJpZM4K83ZJ .

cdeil commented 7 years ago

@paulray - But we want to reach the point where users can just install pint via

pip install pint
conda install pint

and once you make stable releases, distros will pick up pint and there will be

apt-get install pint
yum install pint
port install pint

If I understand correctly, there data files you have aren't very small (PyPI has a tarball upload limit of 30 MB I think), so aren't appropriate to distribute with the pint package.

And also there are some files where users or sysadmins might want to update to newer versions.

That sounds to me a lot like coupling (requiring) data file download at pint install time (by putting it in setup.py) is not a good solution, it should be moved out somehow.

@paulray - Is that agreed?

If yes, we could brainstorm on minimal changes to make that possible here. One possible milestone could be that we get a travis-ci build working where we install pint without downloading the data files, and then by calling another script or function in pint download or update the data after.

cdeil commented 7 years ago

I just tried making a tarball for PINT for the first time:

$ python2 setup.py sdist

I don't know if it's working, I didn't try to install pint from it. But I think it has 180 MB of data in it:

$ ls -lh dist/pint-0.5.1+88.g7dcbc68.tar.gz 
-rw-r--r--  1 deil  staff   190M Mar  7 20:32 dist/pint-0.5.1+88.g7dcbc68.tar.gz
paulray commented 7 years ago

Unfortunately, we are never going to be able to pip install pint because of the package naming conflict you pointed out earlier. I think we gave up on that, figuring we'll have a relative small user base and they can just install from git or maybe a tarball.

The big files seem to be all pint/datafiles/*.bsp and those won't be downloaded once @luojing1211's PR is merged. Maybe that will be all we need?

cdeil commented 7 years ago

The big files seem to be all pint/datafiles/*.bsp and those won't be downloaded once @luojing1211's PR is merged. Maybe that will be all we need?

I don't see that PR. Can you link to it (maybe later if it's not opened yet)?

luojing1211 commented 7 years ago

PR #209

On Tue, Mar 7, 2017 at 3:56 PM, Christoph Deil notifications@github.com wrote:

The big files seem to be all pint/datafiles/*.bsp and those won't be downloaded once @luojing1211 https://github.com/luojing1211's PR is merged. Maybe that will be all we need?

I don't see that PR. Can you link to it (maybe later if it's not opened yet)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nanograv/PINT/issues/191#issuecomment-284872675, or mute the thread https://github.com/notifications/unsubscribe-auth/AFFAg_oFGvbMaelukZwmCcMe_qBsv3Urks5rjdKggaJpZM4K83ZJ .

cdeil commented 7 years ago

@luojing1211 - Thanks for the link.

I had a quick look and also at the older discussions in #115.

It's pretty complicated. Is there a docs page describing to users how download / update of data files works, and which get picked up from where?

If you're still considering changes, my recommendations would be:

In any case, I'll try again after #209 is merged. And feel free to ignore me any time here, if you all are happy with the current scheme that does parts via setup.py.

paulray commented 7 years ago

PR #209 has been merged and now pint/datafiles is much smaller, at least. Running setup does check some files, but seems to be smart about not downloading files that are not new:

mcxr2 : 485>python setup.py install --user
Using cython...
URL transformed to HTTPS due to an HSTS policy
--2017-03-08 10:22:13--  https://naif.jpl.nasa.gov/pub/naif/generic_kernels/pck/earth_latest_high_prec.bpc
Resolving naif.jpl.nasa.gov... 137.78.232.95
Connecting to naif.jpl.nasa.gov|137.78.232.95|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘pint/datafiles/earth_latest_high_prec.bpc’ not modified on server. Omitting download.

Downloaded file 'pck00010.tpc' looks good.
Downloaded file 'naif0012.tls' looks good.
Downloaded file 'de-403-masses.tpc' looks good.

So, some progress. At some point a better scheme for data files should be worked out.

luojing1211 commented 7 years ago

Astropy Auto downloads the IERS data. Could we borrow astropy's scheme?

paulray commented 6 years ago

Maybe should revisit this issue now to see if we can remove file downloads from setup.py?

aarchiba commented 5 years ago

Re-revisit? Note #383 and #376

aarchiba commented 5 years ago

Fixed by #502