Removing data from a user-supplied DataFrame might impact certain properties of the data (e.g., the order in which applications, platforms, and/or problems appear).

Rather than complicate our implementation with workarounds that might not address every possible use-case, we can simply detect and reject problematic data.

Related issues

This effectively reverts #22. It's an alternative solution to the one proposed in #63.

Proposed changes

Reject duplicate (application, platform) pairs when calculating PP and plotting cascades.
Prevent pp from sorting implicitly during its groupby operation.
Update tests to match new expected behavior.

The upshot of the changes here is intended to be:

For simple datasets with no duplicate results, there is no change in behavior.
For complex datasets with duplicate results (which may occur after projection), we throw a ValueError.
All of our calculation and plotting functions should now respect the order of the data given to them, so the user regains the ability to control the order of applications and platforms in the legend simply by sorting their data.

In my own offline testing of complex P3 workflows, I've found that I need to insert an additional line to prepare data the way I typically want it to be plotted:

eff_df = p3.metrics.application_efficiency(projected_df, foms="higher")
eff_df = eff_df.sort_values("app eff").drop_duplicates(["application", "platform"], keep="last").sort_index()
cascade = p3.plot.cascade(eff_df)

I don't think this is too bad, and it only shows up in complicated cases. If we wanted to simplify this workflow, we could consider introducing something like:

eff_df = p3.metrics.application_efficiency(projected_df, foms="higher", keep="best") # or keep="all", keep="latest"
cascade = p3.plot.cascade(eff_df)

...but I'd want to explore that separately, to make sure that we design and test it properly.

intel / p3-analysis-library

Reject duplicate results when handling efficiencies #65

Related issues

Proposed changes