intel / p3-analysis-library

A library simplifying the collection and interpretation of P3 data.
https://intel.github.io/p3-analysis-library/
MIT License
7 stars 10 forks source link

Reject duplicate results when handling efficiencies #65

Closed Pennycook closed 2 weeks ago

Pennycook commented 3 weeks ago

Removing data from a user-supplied DataFrame might impact certain properties of the data (e.g., the order in which applications, platforms, and/or problems appear).

Rather than complicate our implementation with workarounds that might not address every possible use-case, we can simply detect and reject problematic data.

Related issues

This effectively reverts #22. It's an alternative solution to the one proposed in #63.

Proposed changes


The upshot of the changes here is intended to be:

In my own offline testing of complex P3 workflows, I've found that I need to insert an additional line to prepare data the way I typically want it to be plotted:

eff_df = p3.metrics.application_efficiency(projected_df, foms="higher")
eff_df = eff_df.sort_values("app eff").drop_duplicates(["application", "platform"], keep="last").sort_index()
cascade = p3.plot.cascade(eff_df)

I don't think this is too bad, and it only shows up in complicated cases. If we wanted to simplify this workflow, we could consider introducing something like:

eff_df = p3.metrics.application_efficiency(projected_df, foms="higher", keep="best") # or keep="all", keep="latest"
cascade = p3.plot.cascade(eff_df)

...but I'd want to explore that separately, to make sure that we design and test it properly.