Closed xhudik closed 8 years ago
@xhudik: Yeah, that was the original way of using Spark, doing the 1-hot encoding somewhere else (in R) and then reading that into Spark (spark-train-N... and spark-test-N...).
However, the Databricks guys helped and gave me https://github.com/szilard/benchm-ml/blob/master/z-other-tools/5xa-spark-1hot.txt (the code @yinxusen refs above).
file: 1-linear/5-spark.txt contains lines: val d_train = load("spark-train-10m.csv").repartition(32).cache() val d_test = load("spark-test-10m.csv").repartition(32).cache()
However, those files are not created anywhere else (0-init). I'm wondering shouldn't be: train-10m.csv and test.csv instead? (those files are in 0-init)