For this what-if analysis, we want to add support for CleanML data cleaning operations. For a list of operations, see here
But not all of their data cleaning methods are properly integrated into the framework, so we can only use a subset of them easily.
We can also add support for additional mislabel cleaning methods
CleanLab has a very nice API that we can use to wrap the ML model
We can implement training data cleaning selected based on their Shapley values
The interface could be that users give a set of pairs with column name and error type and then we try out different cleaning techniques and output a report. In the case of mislabel, we do not need a column name.
Description
mislabel
cleaning methodsmislabel
, we do not need a column name.Potential set of data cleaning methods