numbats / vaast

This package computes scagnostics, scatterplot diagnostics, for data with multiple numeric variables.
https://numbats.github.io/vaast/
Other
1 stars 0 forks source link

which scags to implement? #3

Open sa-lee opened 3 years ago

sa-lee commented 3 years ago

Graph based

Outlier removal happens first based on edge lengths of the MST.

Non-graph based

@uschiLaa in #2 mentioned she has some implementations for splines

Small version of dcor2d and splines2d (without binned option) is in the tourr package, see https://github.com/ggobi/tourr/blob/master/R/interesting-indices.r In my experience binning is not needed for the splines based index which is fast even with large samples

uschiLaa commented 3 years ago

Small comment: outlier removal (as done in the current implementation) is based on the MST and is done before calculating any of the measures, so there is always some dependence on that.

sa-lee commented 3 years ago

I'm not sure how well the outlier removal works - at least based on my understanding of it / is it a good idea to remove outliers when you don't have an idea of the data generating process. Maybe this is something we could get Harriet to test out @dicook ?