uwdata / arquero

Query processing and transformation of array-backed data tables.
https://idl.uw.edu/arquero
BSD 3-Clause "New" or "Revised" License
1.22k stars 64 forks source link

Thematic mapping utilities #68

Open ericemc3 opened 3 years ago

ericemc3 commented 3 years ago

An extension to op.ntile() could prove useful to encode numeric values to categories from manual breaks. Something similar to the R cutfunction: dens_code = cut( pop_density, breaks = c(0, 1000, 5000,20000,100000, Inf)...) or d3.scaleThreshold()

jenks() and kmeans() are also useful clustering methods, we can borrow them from the simple statistics library, but of course if they were in Arquero it would be convenient.

jcmkk3 commented 3 years ago

santoku is a very featureful package for R that tries to improve on cut. I don't think that Arquero needs this many features, but there could be some API inspiration there, in addition to d3 and other patterns in the JavaScript world.

jheer commented 3 years ago

I'd be happy to consider a new cut / chop / etc implementation for inclusion in Arquero. Similar to recode it might be added as a new standard op function.

As for clustering algorithms, I think those might be more fitting as extensions defined in a separate package, as discussed in #67.

ericemc3 commented 3 years ago

Great, thanks!

ericemc3 commented 3 years ago

A simple implementation for an op.cut could just be: consider for instance breaks = [t1, t2, t3] recode x with: x ∈ [min, t1[ => 0 x ∈ [t1, t2[ => 1 x ∈ [t2, t3[ => 2 x ∈ [t3, max] => 3