Closed barrybecker4 closed 8 years ago
I can't work out where to put the cache call so as to avoid the warning. I tried calling cache() the RDD[Row] object passed to sqlContext.createDataFrame in TestHelper, I tried calling cache() on the resulting dataframe, I tried calling cache on the processed dataframe passed to fit(), but all attempts still give the warning.
I read [this post|http://stackoverflow.com/questions/35253990/how-to-pass-params-to-a-ml-pipeline-fit-method] which had this odd exchange at the very bottom:
Thanks @zero323, i am also getting same warning even after i cached my Dataframe. How can i resolve this? – Kaushal Mar 28 at 14:26 @Kaushal AFAIK you don't. Caching input DF should be enough even if there is some overhead. – zero323 Mar 28 at 14:37
Interestingly, the only way I have found to avoid the warning is to add a call to cache() on the input when it is generated inside the call to fit(dataset) in MDLPDiscretizer like this:
val input = dataset.select($(labelCol), $(inputCol)).map {
case Row(label: Double, features: Vector) =>
LabeledPoint(label, features)
}.cache()
Is that the correct fix? When I do that, all the unit tests run without giving the warning. I initially thought they ran faster, but there actually does not appear to be much difference. Is there any other way to avoid the warning?
@barrybecker4 Regarding "odd exchange" as far as I remember the issue was ML specific. Internally data is converted from DataFrame to RDD, so even if we cache data before fitting, MLlib code receives that is not cached and complains. I am not sure if it is applicable in your case.
I agree with @zero323. It seems a problem specific from ML. Before ML, I usually cached RDD in the test code and I didn't get any warning related to this problem. So, we can adopt the solution proposed by @barrybecker4, which does not imply a modification of "MDLPdiscretizer".
Currently there is a warning shown repeatedly when running the unit tests:
@sramires wrote: The client is supposed to persist the data as he/she wants to. Most of methods have reviewed in MLlib use this message to warn about data persistence. I think it's good to follow this solution. Therefore, it'd be nice to include a .cache() call in the unit tests because the algorithm is designed to be iterative in some parts.
Including a cache() call should eliminate the warnings.