Handling incompleteness in the GTP dataset

liamgd commented 1 year ago

The raw GTP dataset includes many NA values. After processing with gtp.py, this leaves a ,, in M.csv and G.csv files in the data directory. For M.csv, about 1 in 481 values are NA, and about 23% of rows contain at least one NA value. The current implementations of the regression throw an error with these missing values. How should this be mitigated?

Modify the regression algorithm to include only the samples that have valid values. This could greatly reduce performance.
Assume the user only provides complete data.
Estimate NA values.
Drop rows with any NA values (about 23% of rows would be removed from M.csv and about 31% from G.csv).

I suggest the fourth option, but I would like to know what is to be expected of this software. In a previous commit, I dropped NA values from G.csv but not from M.csv because it was not needed for the Pearson correlation coefficient algorithms, I presume.

liamgd commented 1 year ago

For now, I dropped the NA values from the G and M dataframes so the regression algorithms can be tested with the GTP dataset. Let me know if this should be the intended behavior or if the regression algorithms should deal with NA values.

kordk commented 1 year ago

Dropping rows with NA is fine. Others have implemented imputation but that can be a feature in a future version or just entirely handled by the user upstream.

liamgd commented 1 year ago

Ok, implemented in 8ff9355.

kordk / torch-ecpg

Handling incompleteness in the GTP dataset #21