ukaea / PROCESS

PROCESS is a systems code at UKAEA that calculates in a self-consistent manner the parameters of a fusion power plant with a specified performance, ensuring that its operating limits are not violated, and with the option to optimise to a given function of these parameters.
https://ukaea.github.io/PROCESS/
MIT License
34 stars 11 forks source link

WIP: Resolve "Add solution comparison tool" - [opened] #2770

Open jonmaddock opened 1 year ago

jonmaddock commented 1 year ago

Merges 1823-add-solution-comparison-tool -> develop

Closes #1823

jonmaddock commented 1 year ago

added 1 commit

Compare with previous version

jonmaddock commented 1 year ago

added 1 commit

Compare with previous version

jonmaddock commented 1 year ago

added 1 commit

Compare with previous version

jonmaddock commented 1 year ago

added 2 commits

Compare with previous version

jonmaddock commented 1 year ago

added 3 commits

Compare with previous version

jonmaddock commented 1 year ago

added 1 commit

Compare with previous version

jonmaddock commented 1 year ago

added 1 commit

Compare with previous version

jonmaddock commented 1 year ago

added 2 commits

Compare with previous version

jonmaddock commented 1 year ago

added 1 commit

Compare with previous version

jonmaddock commented 1 year ago

@timothy-nunn I'd be grateful for your review. Thanks!

jonmaddock commented 1 year ago

added 2 commits

Compare with previous version

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 20, 2023, 15:43

Commented on requirements.txt line 26

Probably want to add this to the setup.py requirements too?

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 20, 2023, 16:33

Commented on process/io/plot_solutions.py line 376

I think https://docs.python.org/3/library/dataclasses.html#dataclasses.asdict is the "accepted" way of doing this, although that shouldn't matter here. Generally though, I am not a huge fan or raw accessing dunder methods/attributes.

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 20, 2023, 16:35

Commented on process/data/scenarios line 1

I'm not sure I see the purpose of this file

jonmaddock commented 1 year ago

added 1 commit

Compare with previous version

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 20, 2023, 17:17

Commented on process/io/plot_solutions.py line 453

This section can be simplified by adding the var_names to the regex string. This can be done as follows, although this looks a bit hacky so we might want to find another solution:

    # Filter in tag by default
    # TODO Make tag in default df.filter
    if var_names is None:
        var_names = [TAG]

    # Filter for optimisation parameters (normalised to initial value
    # e.g. xcm001) values and names, objective function value and name
    filtered_results = results_df.filter(
        regex=f"{NORM_OPT_PARAM_VALUE_PATTERN}|{NORM_OPT_PARAM_NAME_PATTERN}|{NORM_OBJF_PATTERN}"
        + f'|{"|".join(var_names)}'.rstrip("|")
    )
jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 20, 2023, 18:06

Commented on process/io/plot_solutions.py line 425

A few things:

  1. Indexing dataframes is something I tend to avoid, I think I have found a way to do this using just filters. See below.
  2. Pandas provides a pct_change method (sadly makes the first row NaNs, hence the .fillna(0)).
from pandas import DataFrame, concat

data = [
    {"name": "entry1", "val1": 22.0, "val2": 1.0},
    {"name": "entry2", "val1": 44.0, "val2": 0.5},
]
columns_to_normalise = ["val1", "val2"]

df = DataFrame(data)

print(df)

df_to_normalise = df.filter(items=columns_to_normalise, axis="columns")
df_normalised = df_to_normalise.pct_change().fillna(0)

df_rest_of = df.filter(items=df.columns.difference(df_normalised.columns))

df_full_normalised = concat([df_normalised, df_rest_of], axis="columns")

print()
print(df_full_normalised)
jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 20, 2023, 18:07

Half way through, will continue tomorrow/next week!

jonmaddock commented 1 year ago

changed this line in version 12 of the diff

jonmaddock commented 1 year ago

added 4 commits

Compare with previous version

jonmaddock commented 1 year ago

Agreed, done.

jonmaddock commented 1 year ago

Quite right; I've corrected this.

jonmaddock commented 1 year ago

It's a symbolic link to the tests/regression/scenarios dir. This allows the test data outside the package dir to be packaged as package_data with process (as plot_solvers can require it), whilst avoiding keeping two copies of the data (one inside the package, one in tests) and keeping them in sync. Check out the commit message for this to clarify.

jonmaddock commented 1 year ago

Agreed, much better to use a single filter step rather than concatenating two. I can't think of a neater way of doing this: we need to filter a load of different variable names, so the regex is going to look chunky. We could pass a regex string to _filter_vars_of_interest instead of the var_names list, but I'm not sure that's much better. I've made the change to this approach.

jonmaddock commented 1 year ago

Thanks for pct_change(), I didn't know about that.

By indexing, do you mean using loc? Why is indexing dataframes considered bad practice? It's used extensively in pandas own tutorial. Maybe I've missed something.

pct_change() is convenient, but we won't want to plot the NaN/0 row that this creates, so won't we need to exclude that row by indexing anyway? Also, I can't quite see how pct_change() could provide the percentage change relative to the first row only: it seems to provide a rolling percentage change between the current value and a fixed interval in the past, as it helpful in time-series data. We always want to compare percentage changes to the same row. For example, when we have 3 mfiles, mfiles 2 and 3 want to be normalised relative to mfile 1. I'm not sure I see how this would do that.

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 24, 2023, 11:10

Commented on process/io/plot_solutions.py line 425

loc and iloc are always things ive been told to avoid because its generally only done in iterations (which are really slow in pandas). Generally, there are more obvious ways of indexing e.g filtering in the case of .loc[:, ...]. although my observations above have maybe changed my mind, I'm not sure indexing by a boolean array will be any more obvious with filter, it would just achieve it in fewer lines

In the case we have multiple mfiles (something I imagine I will realise as I continue this review) pct_change will probably, you are right, not work very well. It just strikes me as something that should be possible in a much easier way--surely normalizing a dataframe to a row is pretty common, right?

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 26, 2023, 10:01

Commented on process/data/scenarios line 1

So when the package is built it will copy the data (because of this line "process.data": ["scenarios/*/*"],) and the data will be available at process/data/scenarios/<regression test>/IN.DAT, correct?

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 26, 2023, 10:09

Commented on process/io/plot_solutions.py line 475

Whilst I agree I am also willing to concede that doing so over so few rows probably does not matter

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 26, 2023, 10:27

Commented on process/io/plot_solutions.py line 411

Why do we not need an _np version of this row?

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 26, 2023, 10:28

Commented on process/io/plot_solutions.py line 395

iloc might be better here because we genuinely want the first row

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 26, 2023, 11:10

Y-axis of the plots should probably be labeled with something more descriptive than "Value"

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 26, 2023, 11:18

Hi Jon,

Other than the questions/comments above I think this is ready to go in. The plots it produces are valuable and working. The problem is not a typical use of pandas which might be the reason looping is required, I suggest we don't get hung up on this as we are dealing with small amounts of data, so speed will not be an issue.

Tim