lifebit-ai / gwas

GWAS pipeline using SAIGE

5 stars 2 forks source link

Improves handling of missing data #63

Closed mcamarad closed 4 years ago

mcamarad commented 4 years ago

Summary:

Improve how NAs are handled in order to provide a more robust processing of the data and avoid pitfalls when dealing with NAs.

Improves documentation as well.

Tasks:

Review current strategy and agree next steps
Review that the data contains a case to test this change, if not, add it
Adds na.rm=TRUE in all transformations and aggregations that requires it
- [x] Check that no NAs will be resulting of aggregating numerical arrays that contain NAs.
Adds dataframes for categorical encodings
Updates docs
Run tests in cloudOS

Comments:

After review, this is the strategy to be used:

For continuous, leave as NA so when the pipe of interest catches this, we can add an specific transformation for the pipe required -> different softwares, different ways to handle NAs
For categorical it fill them as a class called Unknown and encodes them to a number which is stored in a .json. In case of wanting to apply any filtering we could use this file in the pipeline of interest if it's plugged into the CB integration process.

Test in Lifebit CloudOS

https://cloudos.lifebit.ai/app/jobs/5f901f038f81710113120819

mcamarad commented 4 years ago

Thanks a lot! 🚀