szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
MIT License
1.87k stars 334 forks source link

5-spark.txt: spark-train-10m.csv #39

Closed xhudik closed 8 years ago

xhudik commented 8 years ago

file: 1-linear/5-spark.txt contains lines: val d_train = load("spark-train-10m.csv").repartition(32).cache() val d_test = load("spark-test-10m.csv").repartition(32).cache()

However, those files are not created anywhere else (0-init). I'm wondering shouldn't be: train-10m.csv and test.csv instead? (those files are in 0-init)

yinxusen commented 8 years ago

Find them here: https://github.com/szilard/benchm-ml/blob/master/z-other-tools/5xa-spark-1hot.txt#L17

szilard commented 8 years ago

@xhudik: Yeah, that was the original way of using Spark, doing the 1-hot encoding somewhere else (in R) and then reading that into Spark (spark-train-N... and spark-test-N...).

However, the Databricks guys helped and gave me https://github.com/szilard/benchm-ml/blob/master/z-other-tools/5xa-spark-1hot.txt (the code @yinxusen refs above).