WIP: Resolve "Add solution comparison tool" - [opened]

jonmaddock commented 1 year ago

Merges 1823-add-solution-comparison-tool -> develop

Closes #1823

jonmaddock commented 1 year ago

added 1 commit

e8971fae - Add plot_solutions tool

Compare with previous version

jonmaddock commented 1 year ago

added 1 commit

6d47cf15 - Add seaborn dependency

Compare with previous version

jonmaddock commented 1 year ago

added 1 commit

f36b139e - Create data dir for MFILEs

Compare with previous version

jonmaddock commented 1 year ago

added 2 commits

2f0bdc84 - Package a symlink to regression test data
104176c3 - Use importlib_resources backport on < Python 3.9

Compare with previous version

jonmaddock commented 1 year ago

added 3 commits

f482d23d - Fetch data dir correctly if already exists
53589446 - Always use the same data dir
a7c4562c - Allow default relative comparison solution

Compare with previous version

jonmaddock commented 1 year ago

added 1 commit

a28e2c0f - Use temp data dir fixture for copy regression ref test

Compare with previous version

jonmaddock commented 1 year ago

added 1 commit

8abe75dc - Consolidate absolute and relative solution plotting funcs

Compare with previous version

jonmaddock commented 1 year ago

added 2 commits

e80e0634 - Fix stripplot label formatting warning
92c0108d - Add plot_solutions example notebook

Compare with previous version

jonmaddock commented 1 year ago

added 1 commit

0fb2edfb - Extract objective function name from MFILE

Compare with previous version

jonmaddock commented 1 year ago

@timothy-nunn I'd be grateful for your review. Thanks!

jonmaddock commented 1 year ago

added 2 commits

d9b9a9d5 - Melt dfs when required for plotting only
aeaf611d - Add optimisation parameter names to solution plots

Compare with previous version

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 20, 2023, 15:43

Commented on requirements.txt line 26

Probably want to add this to the setup.py requirements too?

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 20, 2023, 16:33

Commented on process/io/plot_solutions.py line 376

I think https://docs.python.org/3/library/dataclasses.html#dataclasses.asdict is the "accepted" way of doing this, although that shouldn't matter here. Generally though, I am not a huge fan or raw accessing dunder methods/attributes.

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 20, 2023, 16:35

Commented on process/data/scenarios line 1

I'm not sure I see the purpose of this file

jonmaddock commented 1 year ago

added 1 commit

e7c56904 - Axis labelling and gridlines

Compare with previous version

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 20, 2023, 17:17

Commented on process/io/plot_solutions.py line 453

This section can be simplified by adding the var_names to the regex string. This can be done as follows, although this looks a bit hacky so we might want to find another solution:

    # Filter in tag by default
    # TODO Make tag in default df.filter
    if var_names is None:
        var_names = [TAG]

    # Filter for optimisation parameters (normalised to initial value
    # e.g. xcm001) values and names, objective function value and name
    filtered_results = results_df.filter(
        regex=f"{NORM_OPT_PARAM_VALUE_PATTERN}|{NORM_OPT_PARAM_NAME_PATTERN}|{NORM_OBJF_PATTERN}"
        + f'|{"|".join(var_names)}'.rstrip("|")
    )

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 20, 2023, 18:06

Commented on process/io/plot_solutions.py line 425

A few things:

Indexing dataframes is something I tend to avoid, I think I have found a way to do this using just filters. See below.
Pandas provides a pct_change method (sadly makes the first row NaNs, hence the .fillna(0)).

from pandas import DataFrame, concat

data = [
    {"name": "entry1", "val1": 22.0, "val2": 1.0},
    {"name": "entry2", "val1": 44.0, "val2": 0.5},
]
columns_to_normalise = ["val1", "val2"]

df = DataFrame(data)

print(df)

df_to_normalise = df.filter(items=columns_to_normalise, axis="columns")
df_normalised = df_to_normalise.pct_change().fillna(0)

df_rest_of = df.filter(items=df.columns.difference(df_normalised.columns))

df_full_normalised = concat([df_normalised, df_rest_of], axis="columns")

print()
print(df_full_normalised)

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 20, 2023, 18:07

Half way through, will continue tomorrow/next week!

jonmaddock commented 1 year ago

changed this line in version 12 of the diff

jonmaddock commented 1 year ago

added 4 commits

ae04d513 - Add seaborn requirement to setup.py
166b55ce - Use dataclasses.asdict for run metadata
a9bf7f7a - Filter in run tag by default
b8f887b4 - Use single regex filter for vars of interest

Compare with previous version

jonmaddock commented 1 year ago

Agreed, done.

jonmaddock commented 1 year ago

Quite right; I've corrected this.

jonmaddock commented 1 year ago

It's a symbolic link to the tests/regression/scenarios dir. This allows the test data outside the package dir to be packaged as package_data with process (as plot_solvers can require it), whilst avoiding keeping two copies of the data (one inside the package, one in tests) and keeping them in sync. Check out the commit message for this to clarify.

jonmaddock commented 1 year ago

Agreed, much better to use a single filter step rather than concatenating two. I can't think of a neater way of doing this: we need to filter a load of different variable names, so the regex is going to look chunky. We could pass a regex string to _filter_vars_of_interest instead of the var_names list, but I'm not sure that's much better. I've made the change to this approach.

jonmaddock commented 1 year ago

Thanks for pct_change(), I didn't know about that.

By indexing, do you mean using loc? Why is indexing dataframes considered bad practice? It's used extensively in pandas own tutorial. Maybe I've missed something.

pct_change() is convenient, but we won't want to plot the NaN/0 row that this creates, so won't we need to exclude that row by indexing anyway? Also, I can't quite see how pct_change() could provide the percentage change relative to the first row only: it seems to provide a rolling percentage change between the current value and a fixed interval in the past, as it helpful in time-series data. We always want to compare percentage changes to the same row. For example, when we have 3 mfiles, mfiles 2 and 3 want to be normalised relative to mfile 1. I'm not sure I see how this would do that.

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 24, 2023, 11:10

Commented on process/io/plot_solutions.py line 425

loc and iloc are always things ive been told to avoid because its generally only done in iterations (which are really slow in pandas). Generally, there are more obvious ways of indexing e.g filtering in the case of .loc[:, ...]. although my observations above have maybe changed my mind, I'm not sure indexing by a boolean array will be any more obvious with filter, it would just achieve it in fewer lines

In the case we have multiple mfiles (something I imagine I will realise as I continue this review) pct_change will probably, you are right, not work very well. It just strikes me as something that should be possible in a much easier way--surely normalizing a dataframe to a row is pretty common, right?

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 26, 2023, 10:01

Commented on process/data/scenarios line 1

So when the package is built it will copy the data (because of this line "process.data": ["scenarios/*/*"],) and the data will be available at process/data/scenarios/<regression test>/IN.DAT, correct?

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 26, 2023, 10:09

Commented on process/io/plot_solutions.py line 475

Whilst I agree I am also willing to concede that doing so over so few rows probably does not matter

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 26, 2023, 10:27

Commented on process/io/plot_solutions.py line 411

Why do we not need an _np version of this row?

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 26, 2023, 10:28

Commented on process/io/plot_solutions.py line 395

iloc might be better here because we genuinely want the first row

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 26, 2023, 11:10

Y-axis of the plots should probably be labeled with something more descriptive than "Value"

jonmaddock commented 1 year ago

In GitLab by @timothy-nunn on Apr 26, 2023, 11:18

Hi Jon,

Other than the questions/comments above I think this is ready to go in. The plots it produces are valuable and working. The problem is not a typical use of pandas which might be the reason looping is required, I suggest we don't get hung up on this as we are dealing with small amounts of data, so speed will not be an issue.

Tim

ukaea / PROCESS

WIP: Resolve "Add solution comparison tool" - [opened] #2770