Improve how NAs are handled in order to provide a more robust processing of the data and avoid pitfalls when dealing with NAs.
Improves documentation as well.
Tasks:
Review current strategy and agree next steps
Review that the data contains a case to test this change, if not, add it
Adds na.rm=TRUE in all transformations and aggregations that requires it
[x] Check that no NAs will be resulting of aggregating numerical arrays that contain NAs.
Adds dataframes for categorical encodings
Updates docs
Run tests in cloudOS
Comments:
After review, this is the strategy to be used:
For continuous, leave as NA so when the pipe of interest catches this, we can add an specific transformation for the pipe required -> different softwares, different ways to handle NAs
For categorical it fill them as a class called Unknown and encodes them to a number which is stored in a .json. In case of wanting to apply any filtering we could use this file in the pipeline of interest if it's plugged into the CB integration process.
Summary:
Improve how NAs are handled in order to provide a more robust processing of the data and avoid pitfalls when dealing with NAs.
Improves documentation as well.
Tasks:
na.rm=TRUE
in all transformations and aggregations that requires itComments:
After review, this is the strategy to be used:
Unknown
and encodes them to a number which is stored in a.json
. In case of wanting to apply any filtering we could use this file in the pipeline of interest if it's plugged into the CB integration process.Test in Lifebit CloudOS
https://cloudos.lifebit.ai/app/jobs/5f901f038f81710113120819