thinline72 / nsl-kdd

PySpark solution to the NSL-KDD dataset: https://www.unb.ca/cic/datasets/nsl.html
Apache License 2.0
117 stars 58 forks source link

About Standartization in part 6 #5

Closed aiyanzidexiaotao closed 4 years ago

aiyanzidexiaotao commented 5 years ago

In chapter 6, you wrote"Note that data is sparse, so it is reasonable to not substract mean for avoiding violating sparsity." But, the Standartization code is "def standardizer(column): return ((col(column) - avg_dict[column])/std_dict[column]).alias(column)" Should I subtract the mean or not?

thinline72 commented 5 years ago

Standartization formula assumes that you are subtracting mean and dividing on standard deviation. But if you subtract mean from the sparse vectors they would be converted to the dense vectors, which might requires reasonably more memory.