Allow extra data points in hydro & variable capacity factors so that …

josiahjohnston commented 5 years ago

…scripts that make input files don't have to track when individual plants will be retired. Also, implement minimum data checking for hydro_simple.

This is one solution to issue #108

bmaluenda commented 5 years ago

This implementation is the most reasonable solution to issue #108. In my opinion, the user should not worry about trimming additional capacity factors that extend beyond a plant's life.

This looks fine at a glance. If I can find some time on Friday, I'll try testing it with different datasets.

mfripp commented 5 years ago

I'm not confident this is a great idea. At the least, users should be given a warning that they have provided data that won't be used. Otherwise it's easy for them to think they've properly specified project details for future years, and wonder why it's not working, or fail to notice it's not working. I assume this is why Pyomo and AMPL don't allow you to provide out-of-range parameter data. Also, users already need to be careful not to provide cost data for times when the project cannot be built (that's how we know it can't be built), so this requirement is just an extension of that.

More generally, I'm concerned that "give us whatever data you have and we'll take what we need" is a new paradigm that we haven't really thought through. Why not let them provide data for all days instead of just sample days, then implement the sampling in Switch? It's a great idea (maybe), but not the current approach. This recommended solution would fit better with that approach than with the current paradigm.

mfripp commented 5 years ago

For what it's worth, my first drafts of Switch 2.0 defined dispatch and commitment decisions for all timepoints for all projects, with constraints that forced them to zero outside the project life. That requires more memory, but would be tremendously simpler for users. But I deferred to Josiah who wanted to continue with the technique of reducing memory requirements by defining variables and constraints only for the active lifetime of each project.

Allowing/requiring users to provide variable capacity factors for all timepoints would make more sense if we were modeling all timepoints for all projects. But we already require users to be aware that we don't model all timepoints (i.e., understand the subtleties of GEN_TPS). Having them provide data only for the relevant timepoints is consistent with that.

josiahjohnston commented 5 years ago

I added some warning messages for extra datapoints.

I favored this approach because forcing the capacity factor input scripts to be aware of plant retirement date and whether or not replacement capacity was included seemed beyond the scope of simple input scripts. It will also force the input script to independently calculate TPS_FOR_GEN (or at least PERIODS_FOR_GEN) .. not exactly rocket science, but it's generally undesirable to duplicate functionality of code, and that logic has several edge cases depending on whether or not you have an option to extend the life of the plant.

I also thought this could make the software incrementally more usable. Many people get scared off while building input sets, and if we make that slightly more friendly and flexible, that could improve things.

I think this is different than gen build decisions because we don't use these timeseries to determine when a project can be built or operated. Hopefully that point isn't confusing to any users.

I don't know what you mean by "making users understand the subtleties of GEN_TPS". Can you elaborate?

Providing a large dataset and then sampling within Switch is a good idea, I think, but beyond the scope of this feature request. Many people struggle with sampling, and by pushing it outside of the codebase, we make everyone replicate solutions.

D just woke up; gotta run.

mfripp commented 5 years ago

OK, I can see that it would be difficult sometimes for users to generate the right variable capacity factors for existing plants, since the question of whether to include them for a particular period depends on the lifetime specified for those plants. Which means users have to learn and re-implement the logic about when those plants will be retired, and avoid providing variable capacity factors after the retirement date. On the other hand, I think it's pretty good discipline for users to know when the existing plants will be retired, and I don't see much problem with enforcing that on the front end. Otherwise, they may just get a much harder to debug (or unobserved) problem later, when those plants produce no output in the later years even though users have provided variable capacity factors.

This is not usually a problem with new facilities, because most of the time those can be built in any period, so users need to supply variable capacity factors for all periods (easy!).

My concern is that relaxing the rules here starts to create a mushy environment, where users can supply some extra data but not others. There's a reason AMPL and Pyomo refuse to accept unusable data; otherwise people may think they're setting up the model one way when it's really being interpreted differently.

I would welcome a more general solution, where we eliminate most of the specialized sets related to build years and retirement ages, and instead define generator properties and calculations for all timepoints. Then we restrict commitment and dispatch to zero before the first build date and after the last retirement date. This would make a lot of things a lot simpler (all the places where we use GEN_TPS or something similar, or check for membership in those sets, could just sum across all timepoints instead). However, we have not taken that approach. Instead we require users to know that there are all these sparse sets (GEN_TPS and others) that identify combinations of generator and timepoint that fall within the active life of projects, and then they have to rely heavily on those in all their calculations. Since users already have to do that, I don't think it's much of a stretch to say that they should only provide variable capacity factors that are within the active life of the projects they have specified.

On the other hand, I can see the converse points:

it's nice not to have to do the math for plants that will retire before the end of the study, and
it would be weird to force users to provide variable capacity factors for plants outside their active life (as would be required in the "more general" approach I mentioned above)

I'm not sure what the best way forward is for now. Warning people when they give unusable data seems like it may not be the best path. I would think we should either ignore the data entirely, or declare it an error (as we do now). We could give a more descriptive error if a lot of people are having a hard time figuring this out. Is this the most common or most confusing case where people get an error about excess data? Anyway, I'm not sure we want to start down the road where everything is accepted but we scold users about things we don't like. Then people will just have a messy experience running Switch. (Although maybe that's better than the current, somewhat unforgiving experience!)

mfripp commented 5 years ago

OK, after thinking about this a little more, I think I can live with this arrangement. We just have to think of variable_capacity_factors.tab as a weather file rather than a generator properties file. Or equivalently, this is how much the plant could produce if it is setup to run during that timepoint.

In the long run, I am very tempted to move toward a simpler, more general approach, where we define dispatch and commitment variables for every project in every timepoint, and just constrain them to zero when the plant is retired. That would fit well with this.

I'd also like to get rid of the gen_is_variable flag, allow specifying gen_max_capacity_factor for any project, and give it a default of 1. Then, users can just specify gen_max_capacity_factor whenever it's relevant, i.e., for variable projects or cogen with fixed output. We can reduce memory requirements for that approach by skipping the constraint when gen_max_capacity_factor is 1.

I'm also tempted to rename all the input files based on their indexing set, so anything that is about a generation project goes in one file, anything that pertains to generation project and timepoint goes in another (e.g., max and min capacity factors, commitment requirements, etc.). I know it's (sort of) hard to merge data from different sources into a single table when generating the inputs, but we've de facto settled on this anyway, at least in part. Some additional factors get added to existing tab files, while others get their own separate tab file. I think it would be easier for users to predict where data should go if there's one tab file per type of entity, just like a normalized database.

But we can leave all that for later.

mfripp commented 5 years ago

But I think I'll roll back the warning about unneeded data. This should probably either be OK or not OK. No point setting people up to just keep seeing warnings every time they run it. Maybe in the longer run we could have different logging levels -- quiet, verbose, diagnostic -- and we could warn about stuff like this if they ask for diagnostic warnings.

mfripp commented 5 years ago

I merged the relaxed treatment of excess renewable and hydro data into the 2.0.3 branch and left the warnings hanging as extra_data_points branch. We can reopen this later if we want to allow users to request diagnostic data on their inputs. Or we can just be happy saying these files provide weather data, which must at least span the active timepoints for projects.

josiahjohnston commented 5 years ago

I like having some warning for this, but this was producing way too many lines in day-to-day testing in my current use case. I was thinking about summarizing warnings with one log line per module, and pushing the details to a separate file for off-line examination and/or troubleshooting. Letting users specify logging levels might work for that too, and would simplify the question of where to store said file.

I've been using python's built-in logging module in other projects for the past year or two, and have been pretty happy with it and its shallow learning curve. It's pretty easy to set up a command line argument for verbosity. I'd suggest defaulting to a policy of showing warnings and above, and warnings should be limited to a summary line per issue. More detailed info can go under a lower logging level (probably INFO rather than DEBUG) that would be filtered out by default.

Would you be favorable to a pull request that implements that logging approach?

I know complete cross product sets with zero-constraints are a common pattern in mathematical programming, which could be beneficial for lowering learning curves for some people. Personally, I find sparse sets easier to work with and think through, but I don't know how many others share my cognitive approach to that. From a pragmatic perspective, I thought we documented significant RAM savings from using sparse sets in general, which had important implications since high resolution problems are frequently memory-limited.

mfripp commented 5 years ago

I think with this particular issue I was hung up on the question of "is it OK to provide extra data or not"? If it's OK and we don't mind if some users routinely do so, then we shouldn't emit a bunch of warnings about it. If we'd prefer users don't provide extra data, then we should probably prevent them from doing so. That's where I came up with the idea of "diagnostic" warnings, i.e., "we don't mind if you provide extra data, but if you really want to know, you'd be cooler if you didn't" (kind of like PEP-8 linting, I guess). But I'm inclined to leave that on the back burner unless we come up with more instances of user behavior that is allowed but not recommended.

We may at some point have been worried about the extra RAM from keeping projects active at all timepoints, but I don't really think it would be too bad, since most projects persist through all timepoints. I conceded on the basis that even if we took that simpler approach, we were still going to need to have a similar amount of complexity to do the online-capacity and fixed-cost calculations. But having worked with it for a few years, I think I'm back to my original view that it would be a lot easier if most elements were defined for every timepoint, and we just let online capacity go to zero before and after the project life.

There would be bigger RAM implications for setting min/max capacity factor for every project every timepoint, eliminating the variable/non-variable distinction (another pet project). But I think those can be addressed by using the default values (0/1) when not needed and skipping the constraints when the default values are encountered. Then users can just supply min/max capacity factors when and where they want, and the RAM requirements will be the same or lower than now.

I think we should do something better with logging in general, and it might be nice to introduce different levels of verbosity (1, 2, 3, like cplex?). I thought of updating the logging as part of the Python 3 upgrade, since we were cleaning up the print statements anyway. But it turned out cleaning up print statements was easy, and I wanted this version to have as few feature-changes as possible, to help us focus in on any Python 3 issues. But maybe we could do a better logging system for 2.0.5 or after? You might want to wait till 2.0.4 settles first, at any rate. And we should probably talk through the framework with a couple of examples before implementing the whole thing.

switch-model / switch

Allow extra data points in hydro & variable capacity factors so that … #109