sdss / sdss_access

Product to dynamically build and download filepaths to SDSS data products
http://sdss-access.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
14 stars 4 forks source link

Environment / state issues when looking up paths from two different releases #34

Open andycasey opened 1 year ago

andycasey commented 1 year ago

I'm routinely looking up paths for SDSS-5 data and SDSS-4 data.

I use a dictionary to lookup instances of SDSSPath(...) so that I don't have to create a SDSSPath object every time I evaluate a path. It looks something like this:

_sdss_path_instances = {}

def path(release, filetype, **kwargs):
  try:
    p = _sdss_path_instances[release]
  except:
    p = _sdss_path_instances[release] = SDSSPath(release)
  finally:
    return p.full(filetype, **kwargs)

But I am encountering very weird behaviour, which I can only conclude must be because of either a strange state issue, or because sdss_access is changing environment variables behind the scenes.

Here is a minimum reproducible example:

import os
from sdss_access import SDSSPath

# create a DR17 instance
p = SDSSPath("dr17")

kwds = {'obj': '2M07591936+1734091', 'apred': 'dr17', 'field': '204+22', 'apstar': 'stars', 'prefix': 'ap', 'reduction': '2M07591936+1734091', 'telescope': 'apo25m'}

# This part works fine.
original_path = p.full("apStar", **kwds)
print(os.path.exists(original_path), original_path)

# Now let's evaluate a SDSS 5 path, but **we are using a different instance**
fake_path = SDSSPath("sdss5").full("mwmVisit", v_astra="0.2.6", cat_id=1, run2d="v6_0_9", apred="1.0")

# Now let's use the original instance to evaluate a DR17 path.
check_path = p.full("apStar", **kwds)
print(os.path.exists(check_path), check_path)

Here is the output:

True /uufs/chpc.utah.edu/common/home/sdss50/dr17/apogee/spectro/redux/dr17/stars/apo25m/204+22/apStar-dr17-2M07591936+1734091.fits
False /uufs/chpc.utah.edu/common/home/sdss50/sdsswork/mwm/apogee/spectro/redux/dr17/stars/apo25m/204+22/apStar-dr17-2M07591936+1734091.fits

Is this the intended behaviour?

I assumed each instance would evaluate paths for that particular release, and that evaluating one path on a different instance would not impact existing instances. Note that this happens even if the second SDSSPath(...) is created in a totally unrelated scope, which makes me think there is some state sharing within sdss_access, or environment variables being changed.

joelbrownstein commented 1 year ago

I can reproduce this issue, although I don't have an explanation:

from sdss_access import SDSSPath

path_dr17 = SDSSPath(release = "dr17")
kwd_dr17 = kwds = {'obj': '2M07591936+1734091', 'apred': 'dr17', 'field': '204+22', 'apstar': 'stars', 'prefix': 'ap', 'reduction': '2M07591936+1734091', 'telescope': 'apo25m'}
path_dr17_full0 = path_dr17.full("apStar", **kwd_dr17)
print(path_dr17_full0)

path_astra = SDSSPath(release = "sdss5")
kwd_astra = {'v_astra':"0.2.6", 'cat_id':1, 'run2d':"v6_0_9", 'apred':"1.0"}
path_astra_full = path_astra.full("mwmStar", **kwd_astra)
print(path_astra_full)

path_dr17_full1 = path_dr17.full("apStar", **kwd_dr17)
print(path_dr17_full1)

print("dr17 paths agree: %r" % (path_dr17_full0 == path_dr17_full1))

Output:

/uufs/chpc.utah.edu/common/home/sdss50/dr17/apogee/spectro/redux/dr17/stars/apo25m/204+22/apStar-dr17-2M07591936+1734091.fits
/uufs/chpc.utah.edu/common/home/sdss50/sdsswork/mwm/spectro/astra/0.2.6/v6_0_9-1.0/spectra/star/00/01/mwmStar-0.2.6-1.fits
/uufs/chpc.utah.edu/common/home/sdss50/sdsswork/mwm/apogee/spectro/redux/dr17/stars/apo25m/204+22/apStar-dr17-2M07591936+1734091.fits
dr17 paths agree: False
joelbrownstein commented 1 year ago

@havok2063 can you have a look at this? It looks like one instance is leaking into the other via the value of the tree version, so the first method call to path_dr17.full() correctly uses dr17 for the tree version, whereas the same instance (i.e. path_dr17) flips to using sdsswork in the second call to path_dr17.full().

havok2063 commented 1 year ago

Hmm, I think the tree instance is in the global namespace of the package but it should be re-generated on each instance of Path. So I'll take a look as soon as I can.

havok2063 commented 1 year ago

I can also reproduce the error. I also found a similar error with the Access class. I think it is an issue with the tree instance in the global namespace. I think moving the tree instantiation to inside the Path object will solve it. I'll work on it soon.

from sdss_access import SDSSPath
s=SDSSPath(release='dr17')
s.full('mangacube', drpver='v2_4_3', plate=8485, ifu=1901, wave='LOG')
'/Users/Brian/Work/sdss/sas/dr17/manga/spectro/redux/v2_4_3/8485/stack/manga-8485-1901-LOGCUBE.fits.gz'

s2=SDSSPath(release='sdsswork')
s2.full('mangacube', drpver='v2_4_3', plate=8485, ifu=1901, wave='LOG')
'/Users/Brian/Work/sdss/sas/mangawork/manga/spectro/redux/v2_4_3/8485/stack/manga-8485-1901-LOGCUBE.fits.gz'

s.full('mangacube', drpver='v2_4_3', plate=8485, ifu=1901, wave='LOG')
'/Users/Brian/Work/sdss/sas/mangawork/manga/spectro/redux/v2_4_3/8485/stack/manga-8485-1901-LOGCUBE.fits.gz'
havok2063 commented 1 year ago

So this is not as trivial as it seems. The way Tree works is to load the environment variable definitions into the user's Python os.environ, designed to mimic the way it works with the tree modules. This way the environment variables are set once in the global namespace, and can be used throughout other scripts. If you want to change your environment between releases, you need to use the replant_tree method on SDSSPath.

kwds = {'obj': '2M07591936+1734091', 'apred': 'dr17', 'field': '204+22', 'apstar': 'stars', 'prefix': 'ap', 'reduction': '2M07591936+1734091', 'telescope': 'apo25m'}

p = SDSSPath("dr17")
p.full("apStar", **kwds)
'/Users/Brian/Work/sdss/sas/dr17/apogee/spectro/redux/dr17/stars/apo25m/204+22/apStar-dr17-2M07591936+1734091.fits'

p.replant_tree('sdss5')
p.full("apStar", **kwds)
'/Users/Brian/Work/sdss/sas/sdsswork/mwm/apogee/spectro/redux/dr17/stars/apo25m/204+22/apStar-dr17-2M07591936+1734091.fits'

p.replant_tree('dr17')
p.full("apStar", **kwds)
'/Users/Brian/Work/sdss/sas/dr17/apogee/spectro/redux/dr17/stars/apo25m/204+22/apStar-dr17-2M07591936+1734091.fits'

I think to redesign this would require a fundamental change to how sdss-tree works in Python. I recommend we leave the functionality as it is currently.

andycasey commented 1 year ago

Thanks for looking into this. I understand, and it would explain the behaviour.

I'm fine for leaving the functionality as-is for a while, but long term I think this is something that should be changed. If we had each SDSSPath have their own internal Tree, and if each Tree expanded variables using some local dictionary of variables when evaluating paths instead of setting environment variables, then it would mean we'd be able to treat Trees (and SDSSPaths) as independent. My intuition is that a single Tree or SDSSPath instance should give reference to everything within that instance, not alter system-wide environment variables to evaluate a path.

At the moment it fails silently by giving the user an incorrect path. In many cases it might be obvious that the path is incorrect because no file exists, but I can easily imagine situations where a user is loading data from one data release (e.g., IPL-1) and another data release (e.g., IPL-2). In that situation, both files would exist and the wrong file would be silently accessed by the path. For anything use case where data are being accessed from different releases (e.g., DR17 processing in Astra), the current behaviour means either replanting Tree on every path evaluation, or reverting to hand-writing paths to make sure they are correct.

I'd be happy to help make changes to sdss-tree to alter this behaviour because I think I will encounter many of these silent cases. (In fact, I had encountered many of them before but I could never pin down the weirdness for why paths were working sometimes, and missing other times).

What do you think? Do you think we should leave the behaviour as-is indefinitely, or is this something that should be changed (on some unspecified timescale)?

havok2063 commented 1 year ago

I think what you suggest would actually create a tighter dependency between the tree and sdss-access packages, rather than the opposite. SDSSPath would be tied directly to a python instance of a specific Tree configuration, rather than using the os environment altogether. We have to be careful here to not discontinue the original uses of sdss_access which was designed to be used with modules managing the environment, in particular at Utah. Changing this behavior would cause a lot of issues in downstream packages. Using the environment variables also still allow people to specify custom locations in their local .bashrc.

There are use cases we need to retain:

The current implementation of sdss_access already allows switching between data releases. You don't need to replant the tree on every path evaluation, only once or twice, to generate all the paths you need in one place.

So I would suggest we leave the behavior as it is currently. However, I think we could perform a check when a path is generated to check the path matches the tree/release set and raise an exception when a mismatch occurs. We could also implement a flag on instantiation of SDSSPath that, if set, would do what you want and do a local environment path replacement rather than an os.expandvars. I think we could add that but make it opt in.