sramirez / spark-MDLP-discretization

Spark implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)
Apache License 2.0
44 stars 27 forks source link

Include cache call in unit tests #18

Closed barrybecker4 closed 8 years ago

barrybecker4 commented 8 years ago

Currently there is a warning shown repeatedly when running the unit tests:

@sramires wrote: The client is supposed to persist the data as he/she wants to. Most of methods have reviewed in MLlib use this message to warn about data persistence. I think it's good to follow this solution. Therefore, it'd be nice to include a .cache() call in the unit tests because the algorithm is designed to be iterative in some parts.

Including a cache() call should eliminate the warnings.

barrybecker4 commented 8 years ago

I can't work out where to put the cache call so as to avoid the warning. I tried calling cache() the RDD[Row] object passed to sqlContext.createDataFrame in TestHelper, I tried calling cache() on the resulting dataframe, I tried calling cache on the processed dataframe passed to fit(), but all attempts still give the warning.

I read [this post|http://stackoverflow.com/questions/35253990/how-to-pass-params-to-a-ml-pipeline-fit-method] which had this odd exchange at the very bottom:

Thanks @zero323, i am also getting same warning even after i cached my Dataframe. How can i resolve this? – Kaushal Mar 28 at 14:26 @Kaushal AFAIK you don't. Caching input DF should be enough even if there is some overhead. – zero323 Mar 28 at 14:37

barrybecker4 commented 8 years ago

Interestingly, the only way I have found to avoid the warning is to add a call to cache() on the input when it is generated inside the call to fit(dataset) in MDLPDiscretizer like this:

val input = dataset.select($(labelCol), $(inputCol)).map {
      case Row(label: Double, features: Vector) =>
        LabeledPoint(label, features)
    }.cache()

Is that the correct fix? When I do that, all the unit tests run without giving the warning. I initially thought they ran faster, but there actually does not appear to be much difference. Is there any other way to avoid the warning?

zero323 commented 8 years ago

@barrybecker4 Regarding "odd exchange" as far as I remember the issue was ML specific. Internally data is converted from DataFrame to RDD, so even if we cache data before fitting, MLlib code receives that is not cached and complains. I am not sure if it is applicable in your case.

sramirez commented 8 years ago

I agree with @zero323. It seems a problem specific from ML. Before ML, I usually cached RDD in the test code and I didn't get any warning related to this problem. So, we can adopt the solution proposed by @barrybecker4, which does not imply a modification of "MDLPdiscretizer".