Closed fatenlouati closed 2 years ago
SparkContext is created with 8 threads on local machine (I guess I had CPU with 8 cores at that moment). Thus it makes sense to split dataset into 8 parts (or 16, 24 etc.) so they are processed in parallel, each in its own dedicated thread.
# Creating local SparkContext with 8 threads and SQLContext based on it
sc = pyspark.SparkContext(master='local[8]')
# Function to load dataset and divide it into 8 partitions
def load_dataset(path):
dataset_rdd = sc.textFile(path, 8).map(lambda line: line.split(','))