tugraz-isds / systemds

An open source ML system for the end-to-end data science lifecycle
Apache License 2.0
37 stars 20 forks source link

MICE Nominal - imputation of numeric and categorical data #116

Closed Shafaq-Siddiqi closed 4 years ago

Shafaq-Siddiqi commented 4 years ago

MICE Nominal for imputing categorical and numerical data. Spark test not included due to some error in cbind operation on frames (Debugging in progress).

Shafaq-Siddiqi commented 4 years ago

Spark test is comment out due to some fixes required in Frame append (cbind) operation in spark context. The error can be reproduced by running the commented out spark test.

corepointer commented 4 years ago

I looked at the bug :bug: you mentioned. I didn't find where to fix it, but I'll leave some more information for somebody to fix it:

I reduced the DML that produces the bug to

F = read($X, data_type="frame", format="csv")
A = cbind(F, as.frame(matrix(1, nrow(F), 1)))
print(toString(A))

To produce the error uncomment the spark test invocation in BuiltinMiceTest.java:49 and replace the content of src/test/scripts/functions/builtin/mice.dml with the three lines of DML above.

Upon running the test, the check for frame block dimensions in FrameBlock.java:1002 will now fail with org.tugraz.sysds.runtime.DMLRuntimeException: Incompatible number of rows for cbind: 98 (expected: 49)

So the block is split and the column to append is not. This results in the dimension mismatch. This is as far as I got. I didn't find where the split happens. I tried specifying dimensions explicitly in the read() function (that gave other errors, which I'll investigate another time) and in an MTD file. That did not help though :-/ Furthermore, the problem seems to occur only with "real" frame data, not with matrices converted to frames with as.frame().