Analyzing the PoCA point cloud and try to find outlier and Clusters out of it

rsehgal commented 5 years ago

Here the target to process the point cloud generated after simulation, The first obvious step would be to visualize th point in 2D and 3D. and make write some outlier detection algorithm to remove the outlier and finally try to find clusters.

Ani1211999 commented 5 years ago

Actually comparisons are not valid because I was referring to earlier kmeans.May need to apply kmeans and its visualization

rsehgal commented 5 years ago

@Ani1211999 and @apoorvabh98 : may i ask both of you to prepare a paragraph mentioning the pros and cons of different clustering and outlier detection algorithm.

Ani1211999 commented 5 years ago

3D VIsualizaation of KMeans Algorithm kmeans_clusters3

Ani1211999 commented 5 years ago

5-point summary of CLusters ,X,Y,Z,Scat_Angle,doca,Labels count,18297.0,18297.0,18297.0,18297.0,18297.0,18297.0 mean,143.90380001844733,-142.42837178908442,-15.058683565304142,0.0020781726888428917,1.8907880218192317,0.0 std,61.713498786096245,61.61197909999722,58.54052581581665,0.05718834457225785,3.3101579006076722,0.0 min,35.010398864746094,-263.2149963378906,-179.99400329589844,-0.37970298528671265,1.0331499652238563e-05,0.0 25%,88.38240051269531,-195.6649932861328,-57.8578987121582,-0.016586799174547195,0.1890919953584671,0.0 50%,141.08599853515625,-138.8000030517578,-16.91659927368164,0.0005704070208594203,0.7104499936103821,0.0 75%,197.5290069580078,-86.46949768066406,25.162399291992188,0.018200699239969254,2.1150801181793213,0.0 max,271.1310119628906,-35.87049865722656,179.7729949951172,0.49060800671577454,59.59669876098633,0.0

,X,Y,Z,Scat_Angle,doca,Labels count,18923.0,18923.0,18923.0,18923.0,18923.0,18923.0 mean,-143.75156193901392,-142.72557519131456,-18.542911766715747,0.0005038876524158286,0.5346113245812124,1.0 std,60.76667477250778,60.83493153162352,56.943378070939154,0.02246567733153096,1.0495080974143667,0.0 min,-249.99600219726562,-250.0,-179.52699279785156,-0.25989601016044617,5.654580036207335e-06,1.0 25%,-196.30599975585938,-195.06849670410156,-59.31030082702637,-0.004770365078002214,0.053186651319265366,1.0 50%,-140.20399475097656,-139.78599548339844,-20.284900665283203,0.0003128389944322407,0.19027100503444672,1.0 75%,-89.26950073242188,-87.39949798583984,19.892799377441406,0.004946419969201088,0.555512011051178,1.0 max,-30.381099700927734,-22.539899826049805,134.4149932861328,0.3098370134830475,24.967500686645508,1.0

,X,Y,Z,Scat_Angle,doca,Labels count,19059.0,19059.0,19059.0,19059.0,19059.0,19059.0 mean,144.90995569227624,143.6314159087378,-16.963632184522766,0.0007849737199248537,1.1105259998756882,2.0 std,61.40458737086126,61.55664561149683,57.69693439155289,0.03960849162017207,2.0036062987018575,0.0 min,35.797401428222656,14.847599983215332,-179.62399291992188,-0.3618150055408478,6.035870114828867e-07,2.0 25%,89.84734725952148,87.74814987182617,-58.614450454711914,-0.010144650004804134,0.10982649773359299,2.0 50%,142.0399932861328,140.87899780273438,-19.332000732421875,0.0004680530109908432,0.4088349938392639,2.0 75%,198.5135040283203,196.42499542236328,22.66819953918457,0.010763899888843298,1.2108049988746643,2.0 max,249.99400329589844,249.99899291992188,134.96800231933594,0.3517560064792633,34.80910110473633,2.0

,X,Y,Z,Scat_Angle,doca,Labels count,18558.0,18558.0,18558.0,18558.0,18558.0,18558.0 mean,-144.5259421363269,143.01978640942266,-15.700594808170562,0.0008103928472720282,1.8583256535412052,3.0 std,61.83683979596756,61.42122100071312,57.815400675086536,0.056274416367562995,3.2969438161743354,0.0 min,-249.99400329589844,16.9950008392334,-179.9320068359375,-0.4322800040245056,1.536290074000135e-05,3.0 25%,-198.6650047302246,87.43007278442383,-57.21517467498779,-0.017683349549770355,0.19400475174188614,3.0 50%,-142.13849639892578,139.87950134277344,-17.90280055999756,0.00048066650924738497,0.7224155068397522,3.0 75%,-88.17717170715332,195.8574981689453,24.724899768829346,0.017998175229877234,2.0662450194358826,3.0 max,-34.698699951171875,249.99899291992188,134.95799255371094,0.3926849961280823,55.48820114135742,3.0

Ani1211999 commented 5 years ago

KMeans visualization over new dataset at precision 4 kmeanscluster-4

Ani1211999 commented 5 years ago

Means of different clusters in KMeans: 0.000810 0.002708 0.000785 0.000504

Ani1211999 commented 5 years ago

NIce evaluation of classification algorithms CLassification_ALgorithm_SElection

Ani1211999 commented 5 years ago

During clustering,we first applied K-Means algorithm on the filtered data.After visualization of the data,we observed that there were four clear clusters present in the data.So we applied K-Means Algorithm with parameters as n_clusters=4 so as to identify the clusters.The algorithm returned with satisfactory results.The simulation was done such that blocks were placed with centers at +-150 in X-Y Axes.KMeans returned outputs in the range +-143 - +-144.Further filtering was not possible due to the presence of outliers.Since means get affected in clustering even due to few number of outliers so we did not achieve improvement in clusters.

So we shifted to K-Medians,which operates similarly like K-Means but instead considers medians as centroids instead of means.Since outliers cannot deviate the median as largely as the mean,as we thought KMedians returned better results than KMeans.Since our dataset is relatively small dataset,the overhead cost of applying sort operation would not affect the efficiency of producing the results in an optimum time. Centers were observed in the range +-145 - +-147 in the X-Y Axes. In the second simulation K-Medians was able to cluster similar material blocks at a precision of 3 decimal places whereas K-Means required a precision of four. Here are some of the results of KMedians and K-Means K-MEDIANS X,Y count,3527.0,3527.0 mean,-145.67633056640625,-146.5143585205078 std 61.9560432434082,59.747432708740234 min,-298.0220031738281,-249.94700622558594 25%,-199.22750091552734,-198.03400421142578 50%,-143.33299255371094,-145.16200256347656 75%,-89.89049911499023,-93.09885025024414 max,-21.27560043334961,-29.654499053955078

K-MEANS

X,Y count,18923.0,18923.0 mean,-143.75156193901392,-142.72557519131456 min,-249.99600219726562,-250.0,-179.52699279785156 25%,-196.30599975585938,-195.06849670410156 50%,-140.20399475097656,-139.78599548339844 75%,-89.26950073242188,-87.39949798583984 max,-30.381099700927734,-22.539899826049805

It can be clearly seen that K-Medians returns better results than K-Means.

Ani1211999 commented 5 years ago

KNN gave an accuracy of 60 percent sir

Ani1211999 commented 5 years ago

Not understanding about the basics of a ROC Curve

Ani1211999 commented 5 years ago

Sir one more request after making a document out of this will you send me the document?

Ani1211999 commented 5 years ago

RAndom FOrests giving output 65% accuracy after feature selection,improvement observed till 67%

Ani1211999 commented 5 years ago

KNN ROC Sir,I did whatever I could make out of that code but resut got differed in multiclass.Here is The ROC for class 2 not sure how it's being made and what model.Also accuracy has been shown

Ani1211999 commented 5 years ago

And whatever you are trying to implement is not a binary search it is basically a heap(more precisely min-heap).It will take average logn time i

Ani1211999 commented 5 years ago

DEcision tree gave an accuracy of 59 percent after feature selection improvement was observed till 62 cent

Ani1211999 commented 5 years ago

Neural Networks returned accuracy of 58%.As you expected random forests returned with the best output among all the respective algorithms

Ani1211999 commented 5 years ago

Sklearn contains packages for bagging and boosting for training weak but homogeneous learners but for using prediction of different models to make a final model we need to do stacking.It is available under the library vecstack.

Ani1211999 commented 5 years ago

Acuuracy max observed was 69%.Classifiers used were neural networks and random forests.

rsehgal / TomoML

Analyzing the PoCA point cloud and try to find outlier and Clusters out of it #1