irvingc / dbscan-on-spark

An implementation of DBSCAN runing on top of Apache Spark
Apache License 2.0
183 stars 58 forks source link

The result is difference with sklearn DBSCAN? #19

Open Iamshg opened 3 years ago

Iamshg commented 3 years ago

It seems that the cluster result of dbscan-on-spark is not right. I used sklearn to generate 100 data, and used sklearn to train data. The result is different from that of dbscan-on-spark . Here are my experimental data and results. Is there something wrong with me or the dbscan-on-spark code. The 100 data is , first and second column is x and y , the third is the output clusterid used dbscan-on-spark with parameter maxPointsPerPartition=12, eps = 0.3 , minPoints = 10 .

-1.0158802720216003,0.147342211484412,0
-1.0146038471276722,0.03688725664273079,0
-0.9891239738704828,0.022213809184072942,0
-0.9638043562674901,0.19370487762607208,0
-0.9528948845748842,0.2606655111008295,0
-0.9333963520783779,0.3087590886721549,2
-0.9075869251302896,0.41694606375014176,2
-0.8897644802049538,0.3652758661784908,2
-0.880180077035953,0.47937007252773894,2
-0.877911880574905,0.586132545648214,2
-0.7753025837331028,0.6042502634413288,2
-0.7707474677539745,0.6849960870555312,2
-0.7311006936368126,0.7227208343955539,2
-0.6983566813753926,0.720816991846437,2
-0.5938859100499629,0.8422336677663952,2
-0.5846965684242103,0.7441899166472538,2
-0.5343754904916749,0.8321019363347932,2
-0.47424038807062,0.8758860679090658,2
-0.3878784206919627,0.915309044770749,2
-0.3289511302371282,0.9573149852131584,2
-0.2959324774808664,0.9325603737985263,2
-0.21870067123075473,0.9970866099086676,2
-0.11691604427847285,0.9778576019025226,2
-0.10195211460944066,0.9709550110857252,2
-0.029517735653622493,0.507649328605902,0
-0.023708638395017974,1.0069445660821037,2
-0.005849731617721753,1.0054354462743158,2
0.004878608148313203,0.33471329223760526,0
0.013060570355586543,0.43901096175191556,0
0.013738246924096587,0.24394131643956118,0
0.015089734435828273,0.3515797069345012,0
0.047196112670520006,0.2206600035440831,0
0.05576443581177526,0.15151041876834784,4
0.07973418606854518,0.054405388341981394,4
0.11072653052995088,0.9518445839926504,2
0.14329032472059547,0.9604927564401988,2
0.15054141777418886,0.03451008884781809,4
0.16128620811774508,-0.09013998774134449,4
0.18846566254628738,-0.05019026114410958,4
0.20397463243581263,0.9737643345398193,2
0.22359870290598133,-0.15409491439129092,0
0.2751214191247985,0.9347761383211116,2
0.287593102624648,-0.24638341424476687,0
0.36088367672771,-0.2411234194792873,3
0.37624208564171424,-0.27216645323731653,3
0.38265901863925844,0.9187882235965241,2
0.3887195306260235,0.9033706015027824,2
0.40914260501491323,-0.3033592048800921,3
0.43346670763470607,0.8748254472162161,2
0.4756292725872716,-0.35619531114932845,3
0.496985556291109,0.8394683116286269,2
0.5240072940588782,-0.415381768593055,3
0.5482086350246445,0.7973898622196185,2
0.5790022726885077,-0.43888757836793596,3
0.6062194604607748,0.7624055296463416,2
0.6168393841819796,-0.4254723492171427,3
0.6776865344598341,0.7288515644281421,2
0.7167688763117702,0.6947432873340993,2
0.7476336573593888,-0.43956394127856613,3
0.7572920472149428,-0.436367630296194,3
0.7740957194369665,0.6125896453298072,2
0.8011753211137118,0.57376379359662,2
0.8417223438915941,0.5526331446967239,2
0.8532880875726063,-0.49795719019316187,3
0.863385424874792,0.4747468867223078,2
0.880275606232329,-0.4927970362051949,3
0.9032177435804262,0.40648142738558235,2
0.9394614201995631,0.37705763226924555,2
0.9446468417742363,0.18276454590281604,2
0.974681291427821,0.04097122039961243,0
0.9767100737539655,-0.02241667025607359,0
0.981331751323225,0.31018095161038134,2
0.9895718078881754,0.1459207265521259,2
0.9924132799217354,0.2407697250631794,2
1.0185336471622448,-0.5338521320792655,3
1.0284991718568057,-0.5059328635382044,3
1.0737327870386202,-0.5064573873771955,3
1.155279215360569,-0.49812019299551963,3
1.2290952646189792,-0.46969221135524847,3
1.2916449149878702,-0.4412009995446211,3
1.3291052136060821,-0.44642696791708286,3
1.4442640860583817,-0.4319727123957646,3
1.4632640963338075,-0.3945052322897577,3
1.476901232024859,-0.3357710068006996,3
1.5790424791153852,-0.2927273617730516,1
1.6839521844358782,-0.2766017069970544,1
1.7288137825879888,-0.24294452147350262,1
1.7460881272956796,-0.19290929215511726,1
1.7596749110664895,-0.1514486544693371,1
1.7891687192218997,-0.10216824559688067,1
1.8414791557100734,0.007705154849708373,1
1.87128705469536,-0.024324243892344288,1
1.8961568159672442,0.14023128237704588,1
1.9140044814018815,0.025879420371462798,1
1.9458748140234137,0.1348425674970267,1
1.9629414250510497,0.2391046347601252,1
1.9968692003241,0.525507288705618,1
2.0040410445424053,0.3872492455105901,1
2.012481427686388,0.4616919308479025,1
2.013488317583532,0.2822368891808566,1

plot the dbscan-on-spark result.

import matplotlib.pyplot as plt
import numpy as np
c = 'C:\\Users\\iamshg\\IdeaProjects\\dbscan-on-spark-master\\src\\test\\resources\\res.txt\\res.csv'
x, y, cluster_id = np.loadtxt(c, delimiter=',', usecols=(0, 1,2),unpack=True)
plt.scatter(x, y, c=cluster_id)
plt.show()

image plot the sklearn dbscan result

from sklearn.cluster import DBSCAN
X = np.array([(i,j) for i,j in zip(x,y)])
y_pred = DBSCAN(eps=0.3,min_samples=10).fit_predict(X)
plt.scatter(x, y, c=y_pred)
plt.show()

image

so , there are some difference between two image . Is there something wrong with me or the code?

ICDI0906 commented 2 years ago

Are there random things,such as seed?

ChinaYiqun commented 2 years ago

The DBSCAN algorithm is deterministic, always generating the same clusters when given the same data in the same order. However, the results can differ when data is provided in a different order. First, even though the core samples will always be assigned to the same clusters, the labels of those clusters will depend on the order in which those samples are encountered in the data. Second and more importantly, the clusters to which non-core samples are assigned can differ depending on the data order. This would happen when a non-core sample has a distance lower than eps to two core samples in different clusters. By the triangular inequality, those two core samples must be more distant than eps from each other, or they would be in the same cluster. The non-core sample is assigned to whichever cluster is generated first in a pass through the data, and so the results will depend on the data ordering.