Missing values count should be performed in unprocessed dataset

NickNtamp commented 1 year ago

As the missing value count is an indicator of missing values in a feature of a dataset, has much more value to be performed in the unprocessed dataset rather than in the processed one - most likely a handling procedure of missing values would had been performed already in the processed dataset.

Also, for monitoring and alerting reason, has more value to count the missing values as a percentage of the total entries of each feature (e.g. if a feature has 100 entries and 3 missing values the missing values count is 3%). This ease a lot the setting of thresholds for the specific metric.

momegas commented 1 year ago

Lets tackle this if we have time this sprint.

stavrostheocharis commented 1 year ago

Have a quick analysis before the implementation.

NickNtamp commented 1 year ago

@stavrostheocharis and @sinnec as you asked for a further description.

In the src->cron_tasks->monitoring_metrics.py you can find the following function:

`async def run_calculate_feature_metrics_pipeline( model: Model, inference_processed_df: pd.DataFrame ): """ Run the pipeline to calculate the feature metrics After the metrics are calculated they are saved in the database """

logger.info(f"Calculating feature metrics for model {model.id}")
feature_metrics_report = create_feature_metrics_pipeline(inference_processed_df)

if feature_metrics_report:
    new_feature_metric = ModelIntegrityMetricCreate(
        model_id=model.id,
        timestamp=str(datetime.utcnow()),
        feature_metrics=feature_metrics_report,
    )

    crud.model_integrity_metrics.create(db, obj_in=new_feature_metric)
    logger.info("Feature metrics calcutated!")`

As you can see we calculate the feature metrics (e.g. missing_value_count, average, min, max etc.) for the processed inference data. I believe that specifically for the calculation of the count of the missing values we have to perform it on the unprocessed dataset. What do you think?

stavrostheocharis commented 1 year ago

Yes, it should be performed at the unprocessed dataset for sure. The only thing here is "where to do it" and then how to save them into the database.

More specifically, let's assume that in the function that handles the calculations we both calculate some metrics on the processed dataset and some others for the unprocessed. Then we have to define a way that these metrics can be seen separately for each one because the current implementation saves them all together in the database.

So, also an adjustment to the schema may be needed.

NickNtamp commented 1 year ago

@stavrostheocharis I suggest to discuss it all together on Monday. What do you think?

squaredev-io / whitebox

Missing values count should be performed in unprocessed dataset #57