Switch 2.0.5 (csv import/export)

switch-model / switch

A Modern Platform for Planning High-Renewable Power Systems

http://switch-model.org/

Other

129 stars 85 forks source link

Switch 2.0.5 (csv import/export) #119

Closed mfripp closed 5 years ago

mfripp commented 5 years ago

This pull request bumps some of the ideas from #115 down the road to 2.0.6, and fast-tracks conversion to .csv as Switch 2.0.5. This is being moved ahead quickly because I will be doing training with significant new users on Aug. 22, and it was really messy to explain which inputs and outputs should be .tab vs. .csv, .tsv or .txt. Now all inputs and outputs are .csv.

This also includes some of the code previously slated for 2.0.5 (up through 7/1/19). However, commits to the next_version branch after then will need to be merged later. This was done because some of those changes need further discussion before we can move ahead. Unfortunately, this leaves some uncontroversial changes out too. Maybe we can cherry pick some of those into this release?

josiahjohnston commented 5 years ago

I just finished testing & pushing the implementation. Although this discussion isn't fully resolved, its' thankfully a minor edit for modules to call load_key_value_inputfile directly, vs load_aug(..., singleton=True) calls that call load_key_value_inputfile.

My point was that we're not using, nor plan to use most of those behaviors for key/value files. If we were using those things, I'd agree with your general sentiment to avoid duplicating code.

For .dat files load_aug ignores all parameters except optional: (i.e. auto_select/params arguments, autodetection of optional params, soon --input-aliases). The index column auto-selection is also irrelevant for this context. Given that 90% of those behaviors are irrelevant for this context, and the goal of load_aug is to provide a convenient wrapper for underlying calls to load, it seems a less-than ideal entry point.

I agree that the interface to load_aug could use streamlining & updating. I like your idea for params to appear exactly once either in mandatory_params or optional_params, and to accept the same style of arguments for both. auto_select defaulting to true sounds nice in theory, but I'd want to double-check the codebase to look for any exceptions.

I don't know if anyone else is using load_aug with custom modules not in our codebase. So, when we do those API changes to load_aug, I suggest implementing them in a new function with a better name, leaving load_aug the way it is, and having it issue a deprecation warning saying we'll be dropping load_aug in release 2.1. Just a reminder, semantic versioning recommends bumping the minor version number when we add new features and bump the patch version number for bugfixes. Given what's going into this release, I'm on the fence about calling it 2.0.5 or 2.1.

Re: more automation of documentation & input file mapping Those ideas sound nice. Putting input file paths into parameter & set specifications would limit people's ability to use other Pyomo DataPortal formats (like direct database connections, json files, etc), but I haven't heard anyone asking for that so probably not a big loss?

mfripp commented 5 years ago

I started replying to your earlier comment on commit e1f7999, but then had to go into a meeting.

I think we want singleton parameter files to support auto_select and param arguments, autodetection of optional params and --input-aliases just as much as any other parameter files (although as I noted, I think auto_select and autodetection of optional params should be the only behavior, not optional arguments, and param should eventually be implemented automatically by tagging each component with name of the .csv it should be loaded from).

It is only by accident that we didn't support those capabilities with .dat files, and I don't think there's any argument for why we should support them for most .csv files but not the singleton .csv files. I've certainly seen a mix of optional and mandatory arguments implemented in .dat files, and it's just as plausible for users to want to apply aliases to these files as any others.

Really the singleton .csv files are just data like the other .csvs, and should be treated the same way as others. We're just requiring them to be rotated and adding code to support that to soothe your concern about viewing wide vs tall csvs. It makes the code, documentation and learning curve a little messier, but I'm ok with that because we have to tell people something about singleton parameters anyway. So it might as well be "put singleton parameters in a vertical csv with headers of 'name' and 'value'" instead of "put singleton parameters in a csv with parameter names on the first row and values on the second row." But I don't want to go further and say, "by the way, you also have to use this different function to load them, and it doesn't accept the same arguments or offer the same options as the normal load_aug function."

josiahjohnston commented 5 years ago

This is increasingly a departure from the original .dat files which did not require a-priori explicit manifests. .dat files have built-in automatic detection and selection per their specifications. AFAIK, DataPortal's support for .dat files doesn't allow for requesting which parameters to look for. The limited arguments in load_aug for .dat files was a direct reflection of DataPortal's abilities and the file specification.

I was personally happy with .dat files and found them easy to read & write. I would have also been happy to concatenate them into a single long file of miscellaneous simple params instead of splitting them up into several different files.

I see one-line-per-argument as a convenience to both end-users and developers who have to read and debug input files. For developers who are writing new modules, I think it would be clearest to either provide a transparent wrapper around the standard DataPortal.load() function (like your initial implementation), or write a separate function to do simple parsing (like my current implementation). Mashing up the concepts adds confusion and is unappealing.

If you feel really strongly about this, then we can just revert my contributions and go with Pyomo's crappy format for stuffing simple parameters in wide and shallow csv files. I've seen other pyomo models that abused this with dozens of parameters in an extremely awkward file and I wished to avoid that slippery slope into lower usability. So far, the only one of our modules that does that is your hydrogen model which I have yet to use, so it won't be a direct encumbrance to me or users that I support. All of the core modules only have a handful of params, which aren't that bad in practice for wide format.

FTR, Pyomo supports usable and scalable specifications of simple parameters in .dat, .json and .yaml formats, but their csv style is bad for anything except simple cases.

mfripp commented 5 years ago

I don't mind departing from .dat files, because my goal here is to provide more simplicity, rather than more power. I want it to be as easy as possible to learn (and then predict) how to create and read input files. We are currently not there -- I've been using this for 4 years and I still have to copy and modify an existing load_aug() call every time I want a new one, and I still get it wrong half the time. And then I have to refresh my memory of ampl's .dat language anytime I want to create a file for a singleton param (OK, I remember that much, but I still have to double-check by looking at an existing file), then go and figure out what function to call to read those.

So I am going for several principles here:

Parameter values are read from columns with the same name in the relevant table.
We ignore extra inputs that aren't needed (extra tables, extra columns, but usually not extra rows) (I don't love this, but it allows flexibility when modules are turned on and off, while enforcing some data consistency)
Input columns (and tables) are optional if and only if the corresponding parameters have default values (there's really no need for users to explicitly specify which ones are optional)
We may try using table aliases soon (applied in load_aug)
As much as possible, singleton parameters are handled the same way as indexed parameters.

Those principles point to always using load_aug (or some other single thing) to read data, always using auto_select, never using optional_params, and making load_aug complain about missing tables if and only if it is trying to read a non-optional column from one and can't find the table (we may already be there, I can't remember).

They also lead pretty directly to the wide csvs for singletons, but it's not a huge detour to use vertical csvs for singletons instead (or maybe even allow both, heaven help us). But using .dat files just doesn't fit in -- it's a whole separate data-specification and -loading scheme, just for parameters that happen to be singleton. It may make the inputs more human-readable, but that may not be a huge help since these are usually generated automatically and there are few very wide tables.

josiahjohnston commented 5 years ago

I'm hearing you've had a lot of chronic frustration with input files and anything to improve that process would have significant value to you. I'm a little surprised that you are bothered by copying+pasting+editing prior examples; to me that's an invaluable technique for speeding up development as well as reducing cognitive load, repetitive stress from typing, and the rates of typo-based errors.

It sounds like reverting to the wide and flat tables is probably the least bad solution given how much this bothers you and your perspective on API calls. The more I think about overloading load_aug with a custom parser for key/value csv files, the less I like it. The goal of that was to provide a thin wrapper around DataPortal.load() while maintaining its core behaviors as much as possible, and overloading it with singleton=True to use a different parser is at odds with that goal. To me, a key coding style is to use standard or pre-existing API calls and well-known libraries as much as possible in order to maximize readability and minimize learnings curve and cognitive load for developers. Unfortunately, I came across that concept after drafting the early versions of this codebase, and shifting closer to that ideal is a slow process.

I appreciate your explanations of the principles you are working with. I have a different take on those, but it's helpful to understand your reasoning. For example, I appreciated that I didn't have to explicitly specify the contents of a key/value file when I wanted to load data from it, and specifying those (as in the wide format) is a little cumbersome. To me, key/value inputs are different than indexed tabular data; while mashing them together into a single format & DataPortal.load() parser is possible, it can rapidly get awkward and inconvenient. I also place a higher priority on human readability of input & output files than developer convenience, especially since I expect many more labor-hours to be spent working with data files than the calls to load data. To me these principles pointed to either using a different DataPortal-supported format that was also human-readable (such as dat, yaml, or json), or using a different parsing function for csv's that weren't directly parsable by DataPortal. But yeah, in practice our core key/value files only have a handful of params, so the wide format isn't a big deal for now. Slippery slope of awkward scalability for human readability is something we can worry about later.

Anyway, the least bad compromise solution seems to be wide csv format that DataPortal can directly parse. And moving forward, we'll be doing an overhaul of our DataPortal wrapper (load_aug) to make it more developer-friendly and hopefully give it a better name. Sound workable?

mfripp commented 5 years ago

Yes, that sounds workable.

I can't say the copy-paste caused me a whole lot of trouble, but having to do it was a sign that something wasn't right. Almost everything in these calls is invariant, so it should be as easy as load_inputs(filename="fuels.csv", params=[m.f_rps_eligible]) or load_inputs(filename="rps_targets.csv", index=m.RPS_YEARS, params=[m.rps_target]). Or, eventually just m.f_rps_eligible = Param(m.FUELS, within=Binary, default=False, input_file='fuels.csv'). Then there's no need for copy-and-paste, and less room for error.

I currently create all the singleton input files from dictionaries with values assigned in my get_scenario_data.py scripts, so it doesn't make a difference to me whether they are vertical or wide. Eventually, certain of these (e.g., the hydrogen parameters) might come from a database, where I might also have a wide table to hold them. Then it's an easy export to a wide .csv. Or I suppose people might want to create a tall table in the database, so I guess that's the same debate.

I can sort of see the argument for yaml or some other data format for key-value pairs, but then we're effectively back to the .dat files again. And I find it easier to say "if there's no index, just make a csv with one header row, one data row and no index column" rather than "if there's no index, put the key-value pairs in a .dat/.yaml/tall-csv file".

I think I'm going to release commit 6c3feb as 2.0.5 (after all my pre-release tests), so people can begin installing it today for the tutorial. I'll leave the tall-csv commits as a separate feature branch. And at some point we can rebase the next_release branch on the 2.0.5 release and then release that as 2.0.6, either before the tutorial (if it doesn't change the basic interface and data formats) or soon afterward.

I know semantic versioning says you should update minor version numbers for feature releases, but pretty much everything we release is a feature upgrade. So we would quickly move up to 2.5, 2.23, 2.71, etc. That may be fine, but it would require us to talk about "Switch 2" instead of "Switch 2.0". This may just be a consequency of (hopefully) having a faster release process than Python (where we'll be talking about "3.7" for quite a while yet). On the other hand, I don't think anyone thinks about which version they're using of pandas, numpy or matplotlib -- they just want the latest.

josiahjohnston commented 5 years ago

Sounds good.

I just pushed those rearrangements to various git branches.

2.0.5 set to 6c3feb
bb2436a archived as two_column_key_value_inputs branch
Copied next_release to 2.0.6 and rebased it onto 2.0.5, manually resolving 3 or 4 conflicts in the process. Tests still pass.

I haven't deleted next_release yet, but we should plan on doing that shortly since it is superseded by 2.0.6.

Re: Semantic versioning Projects that follow semantic versioning and have active release schedules bump the minor version frequently, which is fine as far as I know. Agreed that this would prompt us to change our language from "Switch 2.0" to "Switch v2", but I don't see any benefit in delaying that transition. And personally, I usually have more confidence in software stability when the minor and bug versions are above 0.

I reviewed the commits since 2.0.0, and most of them were bug fixes, although there were a few new minor features thrown in. My 5 outstanding pull requests do qualify as new features. We could fudge things and say those features are too minor to bump the minor version, but I don't recommend doing that indefinitely.

I just re-reviewed https://semver.org/ and the section about deprecating functionality. Given that we are changing the input file specifications and effectively deprecating .tab & .dat functionality, I think setting this version to 2.1 instead of 2.0.5 would be a much clearer signal about the impacts; at least to people who understand semantic versioning and use it as a shorthand. At the moment, I think our audience of professional developers is small, so not following semantic versioning guidelines in this case probably isn't that big of a deal in practice.

josiahjohnston commented 5 years ago

PS. After the 2.0.6 branch passes a sniff test, I'll need to update several pull requests to that branch instead of the next_version branch.