Fix function for results checking

jnnr commented 4 years ago

As described in #13, the assertion in check_if_csv_dir_equal does not fail as expected when one of the directories is empty. Also, logfiles that end in .log.2020-04-05 e.g. lead the check to fail, see #24.

This PR fixes that problem by comparing the length of lists of the files that are compared. If the lists do not have the same number of elements, the assertion fails. Also, the 'ignore' option has been removed, and the function now only checks csv files.

Lessons learned: There is a python library called filecmp that can probably do most of this. In particular, there is directory comparison: filecmp.dircmp (https://docs.python.org/2/library/filecmp.html#the-dircmp-class). If I had known that earlier, I would have considered using it.

jnnr commented 4 years ago

I adapted the function according to your suggestions and added some tests. Please have another review!

unndreay commented 4 years ago

I retested this with the steps I had to take to adopt to Scalars v0.02 (#32). I suppose this to be a real-life application of the assertions.

When it comes to the point where output is generated the first time (and ready to be compared) I get the following output:

16:50:48-INFO-Creating wind-onshore profiles
16:50:49-INFO-Creating wind-offshore profiles
16:50:49-INFO-Creating solar pv profiles
DataFrame.iloc[:, 3] (column name="amount") are different

DataFrame.iloc[:, 3] (column name="amount") values are different (100.0 %)
[left]:  [93481000000.0, 148476000000.0, 81731000000.0, 87873000000.0, 815155000000.0, 52246000000.0, 795247000000.0, 528236000000.0, 9041000000.0, 196805000000.0, 210883000000.0]
[right]: [93481000.0, 148476000.0, 81731000.00000001, 87872999.99999999, 815155000.0, 52246000.0, 795247000.0, 528236000.0, 9041000.0, 196805000.0, 210883000.0]
Files /home/unndreay/Workspaces/oemo-flex/experiment_1/002_data_preprocessed/FlexMex1_10/data/elements/load.csv and /home/unndreay/Workspaces/oemo-flex/experiment_1/002_data_preprocessed/FlexMex1_10_default/data/elements/load.csv differ.
DataFrame.iloc[:, 3] (column name="amount") are different

DataFrame.iloc[:, 3] (column name="amount") values are different (100.0 %)
[left]:  [93481000000.0, 148476000000.0, 81731000000.0, 87873000000.0, 815155000000.0, 52246000000.0, 795247000000.0, 528236000000.0, 9041000000.0, 196805000000.0, 210883000000.0]
[right]: [93481000.0, 148476000.0, 81731000.00000001, 87872999.99999999, 815155000.0, 52246000.0, 795247000.0, 528236000.0, 9041000.0, 196805000.0, 210883000.0]
DataFrame.iloc[:, 5] (column name="capacity") are different

DataFrame.iloc[:, 5] (column name="capacity") values are different (100.0 %)
[left]:  [24377.0, 92063.0, 43094.0, 19142.0, 209918.0, 8852.0, 139893.0, 123724.0, 6222.0, 49091.0, 38604.0]
[right]: [36523.0, 82906.0, 43948.0, 24016.0, 144431.0, 6480.0, 102445.0, 139458.0, 5910.0, 37213.0, 40822.0]
Files /home/unndreay/Workspaces/oemo-flex/experiment_1/002_data_preprocessed/FlexMex1_10/data/elements/pv.csv and /home/unndreay/Workspaces/oemo-flex/experiment_1/002_data_preprocessed/FlexMex1_10_default/data/elements/pv.csv differ.
DataFrame.iloc[:, 5] (column name="capacity") are different

DataFrame.iloc[:, 5] (column name="capacity") values are different (100.0 %)
[left]:  [24377.0, 92063.0, 43094.0, 19142.0, 209918.0, 8852.0, 139893.0, 123724.0, 6222.0, 49091.0, 38604.0]
[right]: [36523.0, 82906.0, 43948.0, 24016.0, 144431.0, 6480.0, 102445.0, 139458.0, 5910.0, 37213.0, 40822.0]
DataFrame.iloc[:, 5] (column name="capacity") are different

DataFrame.iloc[:, 5] (column name="capacity") values are different (54.54545 %)
[left]:  [0.0, 5579.0, 0.0, 0.0, 71962.0, 592.0, 0.0, 0.0, 0.0, 25598.0, 0.0]
[right]: [0.0, 5579.0, 0.0, 0.0, 66363.0, 2913.0, 59438.0, 31498.0, 0.0, 21298.0, 13906.0]
Files /home/unndreay/Workspaces/oemo-flex/experiment_1/002_data_preprocessed/FlexMex1_10/data/elements/wind-offshore.csv and /home/unndreay/Workspaces/oemo-flex/experiment_1/002_data_preprocessed/FlexMex1_10_default/data/elements/wind-offshore.csv differ.
DataFrame.iloc[:, 5] (column name="capacity") are different

DataFrame.iloc[:, 5] (column name="capacity") values are different (54.54545 %)
[left]:  [0.0, 5579.0, 0.0, 0.0, 71962.0, 592.0, 0.0, 0.0, 0.0, 25598.0, 0.0]
[right]: [0.0, 5579.0, 0.0, 0.0, 66363.0, 2913.0, 59438.0, 31498.0, 0.0, 21298.0, 13906.0]
DataFrame.iloc[:, 5] (column name="capacity") are different

DataFrame.iloc[:, 5] (column name="capacity") values are different (54.54545 %)
[left]:  [21817.0, 6347.0, 14158.0, 27011.0, 116908.0, 13037.0, 219932.0, 123588.0, 686.0, 10638.0, 53518.0]
[right]: [27956.0, 6347.0, 14158.0, 24107.0, 100104.0, 6502.0, 140091.0, 123588.0, 686.0, 10638.0, 35035.0]
Files /home/unndreay/Workspaces/oemo-flex/experiment_1/002_data_preprocessed/FlexMex1_10/data/elements/wind-onshore.csv and /home/unndreay/Workspaces/oemo-flex/experiment_1/002_data_preprocessed/FlexMex1_10_default/data/elements/wind-onshore.csv differ.
DataFrame.iloc[:, 5] (column name="capacity") are different

DataFrame.iloc[:, 5] (column name="capacity") values are different (54.54545 %)
[left]:  [21817.0, 6347.0, 14158.0, 27011.0, 116908.0, 13037.0, 219932.0, 123588.0, 686.0, 10638.0, 53518.0]
[right]: [27956.0, 6347.0, 14158.0, 24107.0, 100104.0, 6502.0, 140091.0, 123588.0, 686.0, 10638.0, 35035.0]
DataFrame are different

DataFrame shape mismatch
[left]:  (8760, 13)
[right]: (8760, 12)
Files /home/unndreay/Workspaces/oemo-flex/experiment_1/002_data_preprocessed/FlexMex1_10/data/sequences/load_profile.csv and /home/unndreay/Workspaces/oemo-flex/experiment_1/002_data_preprocessed/FlexMex1_10_default/data/sequences/load_profile.csv differ.
DataFrame are different

DataFrame shape mismatch
[left]:  (8760, 13)
[right]: (8760, 12)
DataFrame.iloc[:, 1] (column name="AT-el-solar-pv-profile") are different

DataFrame.iloc[:, 1] (column name="AT-el-solar-pv-profile") values are different (56.81507 %)
[left]:  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0109259, 0.017753799999999997, 0.0505276, 0.096954, 0.1365485, 0.1583889, 0.15565079999999998, 0.1187806, 0.0614354, 0.0204775, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0109137, 0.0190981, 0.0491072, 0.0832056, 0.1077531, 0.1186593, 0.1104708, 0.0818264, 0.042275, 0.0163638, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0109016, 0.0136263, 0.0354269, 0.06676289999999999, 0.0885589, 0.0980915, 0.0885507, 0.0613015, 0.0326926, 0.0149834, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0108895, 0.0136112, 0.0326654, 0.0571617, 0.07485119999999999, 0.08301289999999999, 0.07348339999999999, 0.0503474, 0.0285742, 0.0149668, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]
[right]: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.031354417, 0.089813541, 0.163079606, 0.14563288900000002, 0.197453217, 0.171502419, 0.140423616, 0.068538007, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.052963529, 0.124203394, 0.276860954, 0.267273886, 0.235339495, 0.207965504, 0.17808451, 0.073552943, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.080478656, 0.128366004, 0.291226023, 0.316220569, 0.309829643, 0.28089302, 0.216071464, 0.101598264, 1.79e-06, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.03211066, 0.077264182, 0.123061144, 0.14792887, 0.147225807, 0.142740871, 0.096196357, 0.049089087, 0.00031336, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]
Files /home/unndreay/Workspaces/oemo-flex/experiment_1/002_data_preprocessed/FlexMex1_10/data/sequences/pv_profile.csv and /home/unndreay/Workspaces/oemo-flex/experiment_1/002_data_preprocessed/FlexMex1_10_default/data/sequences/pv_profile.csv differ.
DataFrame.iloc[:, 1] (column name="AT-el-solar-pv-profile") are different

From my point of view, the output is still hard to read. The blank lines group the messages less understandable and output is repeated (DataFrame.iloc[:, 3] (column name="amount") are different and DataFrame.iloc[:, 3] (column name="amount") values are different (100.0 %)). It is possible to clean this up a bit further?

EDIT Sorry, bulk output (all assertions at the same time) is due to another change from my side. THe output with only first exception raised looks like this:

Traceback (most recent call last):
  File "/home/unndreay/Workspaces/oemo-flex/oemoflex/helpers.py", line 122, in check_if_csv_dirs_equal
    check_if_csv_files_equal(file_a, file_b)
  File "/home/unndreay/Workspaces/oemo-flex/oemoflex/helpers.py", line 80, in check_if_csv_files_equal
    assert_frame_equal(df1, df2)
  File "/home/unndreay/.virtualenvs/oemo-flex/lib/python3.7/site-packages/pandas/_testing.py", line 1382, in assert_frame_equal
    obj=f'{obj}.iloc[:, {i}] (column name="{col}")',
  File "/home/unndreay/.virtualenvs/oemo-flex/lib/python3.7/site-packages/pandas/_testing.py", line 1191, in assert_series_equal
    obj=str(obj),
  File "pandas/_libs/testing.pyx", line 65, in pandas._libs.testing.assert_almost_equal
  File "pandas/_libs/testing.pyx", line 174, in pandas._libs.testing.assert_almost_equal
  File "/home/unndreay/.virtualenvs/oemo-flex/lib/python3.7/site-packages/pandas/_testing.py", line 915, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: DataFrame.iloc[:, 3] (column name="amount") are different

DataFrame.iloc[:, 3] (column name="amount") values are different (100.0 %)
[left]:  [93481000000.0, 148476000000.0, 81731000000.0, 87873000000.0, 815155000000.0, 52246000000.0, 795247000000.0, 528236000000.0, 9041000000.0, 196805000000.0, 210883000000.0]
[right]: [93481000.0, 148476000.0, 81731000.00000001, 87872999.99999999, 815155000.0, 52246000.0, 795247000.0, 528236000.0, 9041000.0, 196805000.0, 210883000.0]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/unndreay/Workspaces/oemo-flex/experiment_1/scripts/FlexMex1_10/FlexMex1_10_runall.py", line 12, in <module>
    FlexMex1_10_preprocessing.main()
  File "/home/unndreay/Workspaces/oemo-flex/experiment_1/scripts/FlexMex1_10/FlexMex1_10_preprocessing.py", line 290, in main
    check_if_csv_dirs_equal(new_path, previous_path)
  File "/home/unndreay/Workspaces/oemo-flex/oemoflex/helpers.py", line 124, in check_if_csv_dirs_equal
    raise AssertionError(f"Files {file_a} and {file_b} differ.\n{e}")
AssertionError: Files /home/unndreay/Workspaces/oemo-flex/experiment_1/002_data_preprocessed/FlexMex1_10/data/elements/load.csv and /home/unndreay/Workspaces/oemo-flex/experiment_1/002_data_preprocessed/FlexMex1_10_default/data/elements/load.csv differ.
DataFrame.iloc[:, 3] (column name="amount") are different

DataFrame.iloc[:, 3] (column name="amount") values are different (100.0 %)
[left]:  [93481000000.0, 148476000000.0, 81731000000.0, 87873000000.0, 815155000000.0, 52246000000.0, 795247000000.0, 528236000000.0, 9041000000.0, 196805000000.0, 210883000000.0]
[right]: [93481000.0, 148476000.0, 81731000.00000001, 87872999.99999999, 815155000.0, 52246000.0, 795247000.0, 528236000.0, 9041000.0, 196805000.0, 210883000.0]

It contains all necessary information, yet it is hard to read. For a cleaner output, I suggest to shorten paths to relative paths and let only one exception be raised.

I would even prefer the bulk output (above) to see the full difference and to be placed in the position to group the necessary changes for similar errors (same error in all csv files can be done in one go) rather than working along a never stopping chain of exceptions (yet another exception raised when one was solved and you'll never know when it stops). But may be a matter of taste.

jnnr commented 4 years ago

Thanks for reviewing! I shortened the output. Now the error message just names the file without path.

I would even prefer the bulk output (above) to see the full difference and to be placed in the position to group the necessary changes for similar errors (same error in all csv files can be done in one go) rather than working along a never stopping chain of exceptions (yet another exception raised when one was solved and you'll never know when it stops). But may be a matter of taste.

I did not fully get your comment, but maybe I will when I continue with data version v0.02.

jnnr commented 4 years ago

To get a better overview when a lot of files are different, it would be better if the user just gets a list of the files that are different. To see the diff in detail, it makes more sense to use a dedicated diff tool like meld.

jnnr commented 4 years ago

I implemented the above suggestion I made.

modex-flexmex / oemof-flexmex

Fix function for results checking #25