Align all caching modules implemented in spark to rely on dataframes

openaire / iis

Information Inference Service of the OpenAIRE system

Apache License 2.0

20 stars 11 forks source link

Align all caching modules implemented in spark to rely on dataframes #1130

Open marekhorst opened 4 years ago

marekhorst commented 4 years ago

Some of the currently implemented caching solutions in spark, namely CachedWebCrawlerJob and PatentMetadataRetrieverJob, are relying on RDDs while we could take advantage of the full potential of spark2 dataframes as it was done in TARA caching (CachedTaraReferenceExtractionJob).

marekhorst commented 4 years ago

It will be nice to run some benchmarks to compare RDD-based solution with the dataframes-based one.