univalence / spark-tools

https://univalence.github.io/spark-tools/
Apache License 2.0
44 stars 8 forks source link

[spark-test] cache dataset when tests are done #11

Closed ahoy-jon closed 5 years ago

ahoy-jon commented 5 years ago

Source : https://slides.com/nastasiasaby/spark-conseils#/35 ( @NastasiaSaby )

Spark-Tests can be improved if we limit the number of actions.

We can automatically cache if possible dataset/dataframes/rdds to speed-up the tests so

val ds:Dataset[T] = ??? 
result.assertContains(expected1, expected2, expected3, ... )
assert(result.count == 3)

is not doing extra computations.

ahoy-jon commented 5 years ago

The datasets and dataframe are now cached. However, we have warning know.

We need to develop a procedure :

def cacheIfNotCached(dataset:Dataset[_]):Unit

to clear the warnings :

19/05/26 17:54:06 WARN CacheManager: Asked to cache already cached data.
19/05/26 17:54:07 WARN CacheManager: Asked to cache already cached data.
19/05/26 17:54:07 WARN CacheManager: Asked to cache already cached data.
ahoy-jon commented 5 years ago

Feature is integrated