thinline72 / nsl-kdd

PySpark solution to the NSL-KDD dataset: https://www.unb.ca/cic/datasets/nsl.html
Apache License 2.0
117 stars 58 forks source link

why divide the dataset into 8 parts? #9

Closed fatenlouati closed 2 years ago

fatenlouati commented 2 years ago

# Function to load dataset and divide it into 8 partitions def load_dataset(path): dataset_rdd = sc.textFile(path, 8).map(lambda line: line.split(','))

thinline72 commented 2 years ago

SparkContext is created with 8 threads on local machine (I guess I had CPU with 8 cores at that moment). Thus it makes sense to split dataset into 8 parts (or 16, 24 etc.) so they are processed in parallel, each in its own dedicated thread.

# Creating local SparkContext with 8 threads and SQLContext based on it
sc = pyspark.SparkContext(master='local[8]')