switch-model / switch

A Modern Platform for Planning High-Renewable Power Systems
http://switch-model.org/
Other
129 stars 85 forks source link

Precise versioning with local branches #118

Open josiahjohnston opened 5 years ago

josiahjohnston commented 5 years ago

This enables clear records of local versions of software, which can be invaluable during R&D for customizations. For example, let's say I check out a current copy the development branch, then add new modules and customize behavior to deal with edge cases and subtle bugs. Each commit I make may result in different solutions for the same dataset, but if every version is labeled as v2.0.4, I lack a clear record of which scenarios I need to re-execute, or how I generated a particular set of results.

PEP 440 explains the concept of local identifiers for this type of use case. In the development environment of my example, installing a copy of switch via pip install path/to/checkout will update the version from 2.0.4 to 2.0.4+[git_sha], or if I have uncommitted changes in the repository, it will be 2.0.4+[git_sha]+localmod. If the current git checkout is tagged as a release (having a git tag starting with 2 in our case), then the local modifier suffix is dropped.

This implementation should have no impact on "quickstart" instructions that install from pypi or conda repositories.

This implementation will try to find the precise local version (relies on git being installed), and write it into switch_model/data/installed_version.txt in the installed package directory. If the attempt to call a git subprocess fails, it will print a warning and provide the base version which is recorded in switch_model/version.py. version.py will attempt to load installed_version.txt from the data directory and will return that string if available; if unavailable, version.py will return the hard-coded version number. Finally, the version is written to the outputs directory to ensure a clear record for archival purposes. This version number is accessible in a) the pip catalog, b) switch --version, c) switch_model.__version__, d) in outputs/software_version.txt

I've used this pattern successfully in other software for scientific computing & medical devices, and it has been a life-saver. The code used here has worked effectively in Mac & Linux environments, and can be compatible with docker packaging. It could use validation in a Windows environment (minimally a basic sniff test), but since it is a nonessential add-on that fails gracefully, I expect it could be integrated even if it doesn't work seamlessly in all development environments.

Additionally, I think we would be better served if pre-release branches update the hard-coded version from 2.0.4 to 2.0.4+next_release, or a similar indication that it isn't a packaged release, and hasn't received the same degree of scrutiny.

josiahjohnston commented 5 years ago

One upshot of this commit is that installing in developer mode with --editable would be counter-indicated if one wished to maintain clean records. Although if you are doing quick edit/test cycles without committing, it wouldn't matter since clean records would be superfluous and this won't help track snapshots of uncommitted code (just flags them as uncommitted).

mfripp commented 5 years ago

Hmm, I definitely need to think a little about this. A few points:

  1. In practice, it seems like publicly published models can usually just use a particular released version of Switch. We should try to make those often, so new features don't languish in unreleased software for long, but this should be workable, certainly simpler than trying to peg a published model to a particular commit.
  2. On the other hand, if we do want to peg a published model to a particular git commit, it is possible to do that, either by giving instructions to checkout that commit, or by including a copy of Switch in the model's repository as a submodule.
  3. Your changes seem to be focused more on re-running models as you update pre-released versions of Switch.
  4. This seems like a lot of extra stuff in general, and especially a lot of stuff to support that specialized use case (e.g., a new data directory within switch_model, which won't even be writeable in many cases, as well as a whole new numbering system). I would expect most users to use a vanilla version of Switch, and create their own custom modules in the study directory that can be managed along with the rest of the study data.
  5. Can this particular use case be managed differently, e.g., by using make to run switch to recreate your outputs, with a dependency on various modules and data files?
  6. I think the best practice is to update the version number whenever a branch is created for the next version. That way we can write and test the data upgrade scripts along with the rest of the code for that version. However, this only really works with a linear commit path. I don't know how we would handle data upgrades within feature branches. It may automatically be OK, as long as they all branch off the next version branch, rather than the master branch.
  7. I think the master branch in the main repository should always correspond to the currently released version of Switch. Users should always beware that anything other than master is prerelease. Again, as long as we merge feature/next-version branches into master and release them often, this shouldn't inconvenience colleagues who depend on near-cutting-edge features. And if they need prerelease software, they can just checkout that particular feature branch (possibly from a forked repository), install as developer, and pull as needed.

I haven't really thought through how all this relates to what you're doing in this branch, but I at least wanted to share my initial reactions.

josiahjohnston commented 5 years ago

Thanks for the quick feedback.

Re: Points 1-3 Yup, the local version suffix is primarily focused on automatically and accurately tracking code and results during the course of active development. It's also applicable to custom branches that never make it into the master branch.

For people who stick to official releases, the only impact will be an unambiguous record of which version of Switch was used to make their results, and a clear indication of whether they accidentally wandered into a branch that diverged from an official release.

In other projects, I've found that accurately tracking (and recording) local versions to be invaluable for expedient troubleshooting and retrospectively understanding how results change as code evolves. While local versioning can be helpful for releasing results for a study (like the pegged git checkout strategy you describe), I've primarily used it for maintaining good records internally.

As far as I can remember, every study I've done or collaborated on has required some code customizations, only a subset of which ever made it into a master branch. This is both with Switch v2 & v1. I expect the only exceptions to the need for custom branches will be if every edit that is needed for a particular study is accepted into the master branch and tested for backwards compatibility (easier to guarantee if working solo or unilaterally, harder if working on a shared codebase).

Even in cases where people primarily wished to adjust inputs of an established study (like the Rhodium Group's extension of a Hawaiian study), they still required custom exports and other tweaks. As the codebase evolves and matures, the need to push the boundaries may reduce, but I don't expect it to fully disappear.

Re: Point 4 Yes, this has some extra stuff, but it all conforms to Python standards. And yes, package data is intended to be write-once during install. Data directories are another concept I've come to value from other projects, and are great for things like this, default configuration files, test data, or other data assets that commonly accompany a software project. There are other styles of setting up python data directories, but this is by far the most stable that I've found after considerable research and testing.

For people who use official releases of Switch, this will have no impact on the version they see.

While Switch 2.0 makes it possible to do any customization by writing new modules outside of the switch_model package (including copy + edit of core modules), I generally recommend learning git and committing to a branch because:

Re: Point 5 No, doing make files for pegging specific versions is impractical. The goal is to have clear records of which version of a moving codebase I've used for a particular run (without making me do an extra step of copying & pasting a manual recrod), not to retroactively give a recipe for replicating results after I've finished everything.

Re: Point 6 Agreed, but using a base version numbers like 2.0.5-alpha for the pre-release branch that follows 2.0.4, rather than 2.0.5. In this use case, I also prefer either an automated local versioning system as implemented here, or using other automated tools for version incrementing that will bump the alpha suffix from .0 to .1, etc before every git commit. Although, as you pointed out, a sequential versioning system breaks down with non-linear branches and merges. The local versioning system in conjunction with a reasonable base version addresses these complexities better than other approaches I've read about to date.

Re: Point 7 That isn't my top choice, but I could live with that. I prefer to have master be the branch that is moving towards the next release, and rely on tags to specify which specific versions are releases. If people want released versions, they should install from conda/pypi repos, or nab a tagged checkout for a particular release. That's the pattern I've seen & worked with most in various github projects..

josiahjohnston commented 5 years ago

I forgot to respond to the data upgrade issue. I see support for data upgrades & backwards compatibility as strictly limited to official sequential releases. I don't see a need for data upgrades on side branches with any of the use cases I'm familiar with, and wouldn't be able to comment on the feasibility of that without understanding specific use cases.

mfripp commented 4 years ago

Finally getting back to this pull request, and I forgot we even had this much discussion of it. I'll check back to your comments above, but after looking at the code, I'm inclined to simplify this a lot:

This simpler version would support

mfripp commented 4 years ago

@josiahjohnston, I think we have two fairly different workflows for using Switch, so I'm looking for something that will work for both. To do that, it would help to know a little more about your workflow.

The code in this branch seems to assume that you will run python setup.py or pip from inside the local git repository for your development copy of Switch, and that this will make a copy of the switch_model package in another location. This makes sense if you are using a virtual environment — you'd first create and activate the environment, then cd to the Switch repository, then run pip or setup.py there. But I wanted to ask, do you use that same workflow somehow for your Docker containers? Those seem to be self-contained, so I'm wondering how you migrate code from the Switch repository into them. Do you mount the host file system inside the Docker container, then cd to the Switch repository and run pip or setup.py to copy the code into the container? Or do you have some other procedure?

By the way, my workflow is generally to have one environment that I use for most active models, and I use pip install --editable . to give it access to the main Switch repository. It's kind of fast and loose, but it allows rapid turnaround between revising and running the model. For older models, I can then install a matching release of Switch, or possibly even a matching commit.

In your workflow, the git repository is visible when you run setup.py, but it's not visible at runtime (and the code doesn't change after you run setup.py). So you need to stamp the installation with the git status. In my workflow, the code may change after I run setup.py, but my code can see the git repository at runtime (that's where it runs from). So I can/must check the git status at runtime, and I don't want the installation to be pre-stamped with a local version number. I think it's possible to reconcile these, but I need to be clearer on how you're using Docker.

josiahjohnston commented 4 years ago

I used virtual environments instead of docker containers. Docker containers were the next step up. I tried offering to set those up while I was still working on this in a professional capacity (not sure if I communicated that intent well), but never got around to that. It's not clear to me if that would help usability with target user base. Dockerfiles are easy enough to set up, and I might be able to pull one together if I stayed up late some night.

If you went with dockerfiles, then docker build would be analogous to pip install in a virtual environment, except you'd have archives of your prior builds. Each docker build could have it's own uniquely tagged version, and you could keep as many of those as you wanted. If set up properly, they'd all share the same underlying layers so more wouldn't take up much disk space. I'd probably set up a script to encapsulate the docker build command and tag each image with the precise version number.

Yup, you are right with impacts of --editable. That's fast and loose and has not way of tracking what code produced a given set of results. Fine for quick iterations where you are tracking a few things in your head. Bad for archiving results and reconstructing them later. This traceability is important for quality lab notebooks, research publications or public proceedings since tiny changes to formulations can lead to big changes in outputs. This is especially true for people who don't have PhDs in energy modeling and 15+ years experience writing & critically interpreting them.

Yup, data upgrade support wouldn't and shouldn't be applied to the precise versions that only differ in the git hash suffix. That functionality only applies if you bother bumping the version number.

All that being said, most people I've worked with are sloppy about git repos and traceability. I keep hoping people will up their game, possibly with the aid of data science curriculums and "Best Practices in Scientific Computing", but that's probably too optimistic. I regard this functionality as crucial for traceability & reproducibility for scientific computing. This is especially important in planning major long-term societal investments and the fate of our planet with global warming, since minor changes to models can produce wildly different results (whether by intention or accident), long-term models are often not numerically stable, and inputs have large uncertainties (both for present and long-term forecasts). But if most practitioners never bother to go through systematic processes, and most published policy papers on energy models decline to release their datasets or code, then I don't know if this feature matters from a practical perspective. And if your use cases involve releasing code and final runs with a single version of code, without needing traceability in your intermediate runs because you are that good, then maybe this isn't useful for you either..

I don't know what changes you are proposing or how that would impact things I used to use on a day-to-day basis to solve my pain points. I'm not working with this codebase in a professional capacity now and don't have the bandwidth to contribute in any real way, or get a deeper dive into how active or hypothetical energy modelers will use this software. If this PR seems useful to you or other users, then keep. If not, do whatever seems useful. If I manage to return to this in the future, I'll take a look at the outcome and can always restore portions that I need for my process & workflow.

mfripp commented 4 years ago

Thanks, that's good to know. I may postpone this for now because it's getting complicated. For later reference, I think there is a strategy that could meet both of our needs (stamping a copy of Switch with repository status while copying it into a virtual environment, and also retrieving repository status directly from a developer install of Switch):

I'm a little unsure how this fits with distributions though. PyPi uses wheels, which could potentially be stamped with repository info during the build process. If the repository info is then reported as part of the version number, it may prevent the wheel from uploading to PyPi (probably a good thing). If it isn't, then we can freely upload a dev or final version without worrying about whether it has been committed to the repository yet (maybe a good thing, maybe not). On the other hand, the conda-forge package builds from the source repository on pypi. I don't think this goes through a 'build' phase before it is uploaded, so I'd need to find some other hook to stamp the source distribution.