Deprecate outdated interfaces of pandas used in rank.py

FuryMartin commented 3 months ago

What would you like to be added/modified:

Some outdated interfaces of pandas are used in in rank.py .

We can deprecate them by:

Implement __get_all() using a more elegant interface.
using pd.concat() to merge old DataFrame and new DataFrame.

The modified method should remain compatible with the version of pandas corresponding to Python 3.6.

Why is this needed:

Ianvs was originally designed for pandas==1.1.5, but the latest version is now pandas==2.2.2.Due to a major version update, some interfaces of pandas have been deprecated in the new version.

Continuing to use these old interfaces will encounter errors on Python>=3.8.

pd.np has been deprecated in pandas>=2.0.0 : https://github.com/kubeedge/ianvs/blob/f2352ce018f04f398b1be0f37d0fa3cd11476626/core/storymanager/rank/rank.py#L208

AttributeError: module 'pandas' has no attribute 'np'

append has been deprecated in pandas>=2.0.0: https://github.com/kubeedge/ianvs/blob/f2352ce018f04f398b1be0f37d0fa3cd11476626/core/storymanager/rank/rank.py#L171
```
AttributeError: 'DataFrame' object has no attribute 'append'
```

initializing all_df with np.NAN will cause str data missing: https://github.com/kubeedge/ianvs/blob/f2352ce018f04f398b1be0f37d0fa3cd11476626/core/storymanager/rank/rank.py#L145

+------+-----------+-----+-----------+----------+-----------+----------+---------------------+----------------------+-------------------+------------------------+---------------------+------+-----+
| rank | algorithm | acc | edge-rate | paradigm | basemodel | apimodel | hard_example_mining | basemodel-model_name | basemodel-backend | basemodel-quantization | apimodel-model_name | time | url |
+------+-----------+-----+-----------+----------+-----------+----------+---------------------+----------------------+-------------------+------------------------+---------------------+------+-----+
|  1   |           | 0.6 |    0.6    |          |           |          |                     |                      |                   |                        |                     |      |     |
+------+-----------+-----+-----------+----------+-----------+----------+---------------------+----------------------+-------------------+------------------------+---------------------+------+-----+

Setting value by df.[row_index][column] will cause SettingWithCopyWarning. This interface will be deprecated in pandas>=3.0 in the future. https://github.com/kubeedge/ianvs/blob/f2352ce018f04f398b1be0f37d0fa3cd11476626/core/storymanager/rank/rank.py#L148

Line 151, 154, 158, 165, 167 have the same issue, too.

./ianvs/core/storymanager/rank/rank.py:148: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  all_df.loc[i][0] = algorithm.name

Assigning values one by one reduces code readability and can be simplified. https://github.com/kubeedge/ianvs/blob/f2352ce018f04f398b1be0f37d0fa3cd11476626/core/storymanager/rank/rank.py#L145-L167
Using whitspace as seperator in a CSV(Comma-Separated Values) file is weird. https://github.com/kubeedge/ianvs/blob/f2352ce018f04f398b1be0f37d0fa3cd11476626/core/storymanager/rank/rank.py#L179

FuryMartin commented 3 months ago

The current implemetation in Ianvs is: https://github.com/kubeedge/ianvs/blob/f2352ce018f04f398b1be0f37d0fa3cd11476626/core/storymanager/rank/rank.py#L142-L173

This is a revised version of the implementation and all interfaces used are compatible with pandas==1.1.5

    def _get_all(self, test_cases, test_results) -> pd.DataFrame:
        all_df = pd.DataFrame(columns=self.all_df_header)

        for i, test_case in enumerate(test_cases):
            algorithm = test_case.algorithm
            test_result = test_results[test_case.id][0]

            # add algorithm, paradigm, time, url of algorithm
            row_data = {
                "algorithm": algorithm.name,
                "paradigm": algorithm.paradigm_type,
                "time": test_results[test_case.id][1],
                "url": test_case.output_dir
            }

            # add metric of algorithm
            row_data.update(test_result)

            # add module of algorithm
            row_data.update({
                module_type: module.name
                for module_type, module in algorithm.modules.items()
            })

            # add hyperparameters of algorithm modules
            row_data.update(self._get_algorithm_hyperparameters(algorithm))

            # fill data
            all_df.loc[i] = row_data

        new_df = self._concat_existing_data(all_df)

        return self._sort_all_df(new_df, self._get_all_metric_names(test_results))

    def _concat_existing_data(self, new_df):
        if utils.is_local_file(self.all_rank_file):
            old_df = pd.read_csv(self.all_rank_file, index_col=0)
            new_df = pd.concat([old_df, new_df])
        return new_df

Comparing to the current implementation, the revised one mainly:

Fill the DataFrame in one go using dict-typed row_data.
Use pd.concat to merge old_df and new_df.
Extract the merging of dataframe into a separate function to make the logic of __get_all() clearer.

Additionally, I removed sep=" " from all CSV read and write functions.

FuryMartin commented 3 months ago

After fixing PCB-AoI's dependencies, I conducted experiments on this example with Python==3.6.13 to prove that our revisions do not introduce new compatibility issues.

The experimental results are as follows:

Run once:

f1_score_avg: 0.8568
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+
| rank |        algorithm        | f1_score |      paradigm      | basemodel | basemodel-momentum | basemodel-learning_rate |         time        |                                           url                                            |
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+
|  1   | fpn_singletask_learning |  0.8694  | singletasklearning |    FPN    |        0.95        |           0.1           | 2024-08-16 22:54:24 | ./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c94-5bde-11ef-bf9b-755996a48c84 |
|  2   | fpn_singletask_learning |  0.8568  | singletasklearning |    FPN    |        0.5         |           0.1           | 2024-08-16 22:57:38 | ./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c95-5bde-11ef-bf9b-755996a48c84 |
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+

Run Twice:

f1_score_avg: 0.8635
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+
| rank |        algorithm        | f1_score |      paradigm      | basemodel | basemodel-momentum | basemodel-learning_rate |         time        |                                           url                                            |
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+
|  1   | fpn_singletask_learning |  0.8707  | singletasklearning |    FPN    |        0.95        |           0.1           | 2024-08-16 23:59:15 | ./workspace/benchmarkingjob/fpn_singletask_learning/08e9a128-5be8-11ef-bf9b-755996a48c84 |
|  2   | fpn_singletask_learning |  0.8694  | singletasklearning |    FPN    |        0.95        |           0.1           | 2024-08-16 22:54:24 | ./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c94-5bde-11ef-bf9b-755996a48c84 |
|  3   | fpn_singletask_learning |  0.8635  | singletasklearning |    FPN    |        0.5         |           0.1           | 2024-08-17 00:02:22 | ./workspace/benchmarkingjob/fpn_singletask_learning/08e9a129-5be8-11ef-bf9b-755996a48c84 |
|  4   | fpn_singletask_learning |  0.8568  | singletasklearning |    FPN    |        0.5         |           0.1           | 2024-08-16 22:57:38 | ./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c95-5bde-11ef-bf9b-755996a48c84 |
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+

The all_rank.csv I get shows as bellow:

rank,algorithm,f1_score,paradigm,basemodel,basemodel-momentum,basemodel-learning_rate,time,url
1,fpn_singletask_learning,0.8707,singletasklearning,FPN,0.95,0.1,2024-08-16 23:59:15,./workspace/benchmarkingjob/fpn_singletask_learning/08e9a128-5be8-11ef-bf9b-755996a48c84
2,fpn_singletask_learning,0.8694,singletasklearning,FPN,0.95,0.1,2024-08-16 22:54:24,./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c94-5bde-11ef-bf9b-755996a48c84
3,fpn_singletask_learning,0.8635,singletasklearning,FPN,0.5,0.1,2024-08-17 00:02:22,./workspace/benchmarkingjob/fpn_singletask_learning/08e9a129-5be8-11ef-bf9b-755996a48c84
4,fpn_singletask_learning,0.8568,singletasklearning,FPN,0.5,0.1,2024-08-16 22:57:38,./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c95-5bde-11ef-bf9b-755996a48c84

These results indicate that the revised version is functioning properly.

hsj576 commented 2 months ago

The overall change looks good to me. Please talk about it at the next community meeting and see if anyone else has any questions.

FuryMartin commented 2 months ago

The overall change looks good to me. Please talk about it at the next community meeting and see if anyone else has any questions.

OK, thanks for the review.

kubeedge / ianvs

Deprecate outdated interfaces of pandas used in rank.py #134