I have implemented a new EmptyInspector class in sdgx/data_models/inspectors/empty.py to identify columns in a DataFrame that have a high rate of missing values. This class tags these columns as empty and removes them during the training process, reinserting them into their original positions after the model sampling process is complete.
Additionally, I have added an EmptyTransformer class in sdgx/data_processors/transformers/empty.py to handle these empty columns by removing them during data conversion and restoring them during reverse conversion.
Motivation and Context
This change is required to improve the handling of missing data in our data processing pipeline.
By identifying and handling empty columns explicitly, we can prevent issues related to missing data during model training and ensure that these columns are correctly reinserted after processing.
This enhancement ensures data integrity and improves the robustness of our models.
How has this been tested?
I have added a new test module tests/data_models/inspector/test_empty.py to verify the functionality of the EmptyInspector class. The test checks if the inspector correctly identifies columns with high missing values and ensures that the inspection level is set correctly. The test environment includes a sample DataFrame with specific columns set to missing values to simulate the scenario.
After all automated tests in the PR have passed, I will then add test cases for the EmptyTransformer.
Types of changes
[ ] Maintenance (no change in code, maintain the project's CI, docs, etc.)
[x] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
Checklist:
[x] My code follows the code style of this project.
[x] My change requires a change to the documentation.
Description
I have implemented a new
EmptyInspector
class insdgx/data_models/inspectors/empty.py
to identify columns in a DataFrame that have a high rate of missing values. This class tags these columns as empty and removes them during the training process, reinserting them into their original positions after the model sampling process is complete.Additionally, I have added an
EmptyTransformer
class insdgx/data_processors/transformers/empty.py
to handle these empty columns by removing them during data conversion and restoring them during reverse conversion.Motivation and Context
This change is required to improve the handling of missing data in our data processing pipeline.
By identifying and handling empty columns explicitly, we can prevent issues related to missing data during model training and ensure that these columns are correctly reinserted after processing.
This enhancement ensures data integrity and improves the robustness of our models.
How has this been tested?
I have added a new test module
tests/data_models/inspector/test_empty.py
to verify the functionality of theEmptyInspector
class. The test checks if the inspector correctly identifies columns with high missing values and ensures that the inspection level is set correctly. The test environment includes a sample DataFrame with specific columns set to missing values to simulate the scenario.After all automated tests in the PR have passed, I will then add test cases for the EmptyTransformer.
Types of changes
Checklist: