szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
MIT License
1.87k stars 335 forks source link

mxnet sparse data format #30

Open szilard opened 8 years ago

szilard commented 8 years ago

Motivation: I can't run mxnet on the 10M records airline set https://github.com/szilard/benchm-ml/issues/29 because model.matrix crashes out of RAM (on g2.8xlarge with 60GB or RAM - largest available for GPU instances).

Using Matrix::sparse.model.matrix to encode the categorical data would be great (uses <2GB RAM), but I get:

Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Strangely on the 1M dataset I get another error:

Error: io.cc:50: Seems X, y was passed in a Row major way, MXNetR adopts a column major convention.
szilard commented 8 years ago

@tqchen @hetong007 Is sparse representation on the roadmap? - see thread above (I know mxnet is very new, and I have to tell you I think it already looks pretty great).

tqchen commented 8 years ago

Yes, this is something we should look into, can you also open an issue on https://github.com/dmlc/mxnet/issues ? Thanks

szilard commented 8 years ago

Cool, I'll do it soon.