nanograv / pint_pal

A long-lived repository for NANOGrav Pulsar Timing workflows and analysis.
MIT License
6 stars 16 forks source link

Configuring the "data repository" location #58

Closed rossjjennings closed 1 month ago

rossjjennings commented 1 month ago

Currently, pint_pal interprets paths in YAML configuration files relative to the current working directory (i.e., where the notebook or script that uses pint_pal is being run). This can make it harder to work in certain ways -- notebooks and scripts have to run from a specific directory in relation to the data files. It could be useful to be able to configure pint_pal to use a different "starting directory" in interpreting these paths, so that notebooks and scripts could be run from an arbitrary location, relative to the data files, without having to change every YAML file.

There are three different use cases we need to think about here, which are related to the fact that pint_pal is designed to be used with an associated "data repository" containing .par, .tim, and .yaml files. These cases are:

  1. There is no data repository in use. You might want to use pint_pal this way if you are working on a single pulsar, or a few pulsars, in an ad-hoc way, independent of any dataset. Since pint_pal doesn't know anything about an associated data repository when it is installed, this is the case we have to assume we're in by default.
  2. There is a data repository in use, and the user is working from the "correct" location, so that paths in YAML files point to the right data files if they are interpreted as relative to the current working directory. This will probably be the case most of the time for people working on a project that uses a data repository.
  3. There is a data repository in use, but the user is working from an alternate location. In this case, the paths specified in the YAML files will point to the wrong locations if they are interpreted as relative to the current working directory, and it is tedious and error-prone to change the paths in each YAML file individually, especially when these changes might propagate to other users of the same data repository who aren't working from the same location.

Cases (1) and (2) work just fine with the current behavior of interpreting paths as relative to the current working directory. It's only in case (3) that you would want to configure pint_pal to interpret paths relative to a different folder, generally the root of the data repository. But in that case, it would be a useful feature to have.

rossjjennings commented 1 month ago

There's an additional wrinkle here that arises in connection with #57, which proposes adding a configuration file that is specific to the data repository being used. If this is implemented, pint_pal will need to know where the data repository is in order to find the configuration file.

rossjjennings commented 1 month ago

Another reason pint_pal might need to know the data repository location is for testing. In fact, in this case, it currently works around not knowing where the data repository root is by assuming it is the parent directory of the directory containing the YAML configuration file it is reading.

rossjjennings commented 1 month ago

My proposed solution for this and #57 is to add a function set_data_root(), which could be used to specify the location of the data repository's root directory in a notebook or script. pint_pal would then look in that location for a configuration file specifying BIPM version, etc., for checks, and interpret paths in YAML files relative to that location. The default data root would be the current working directory.

rossjjennings commented 1 month ago

Resolved by #63.