tidymodels / tidyclust

A tidy unified interface to clustering models
https://tidyclust.tidymodels.org/
Other
106 stars 14 forks source link

batch prediction #57

Open kbodwin opened 1 year ago

kbodwin commented 1 year ago

To think about for the future: Sometimes it might make sense to predict in batches. That is,

predict(c(x1, x2))

returns different results from

predict(x1), predict(x2)

This is relevant in hierarchical clustering. If we are doing single-linkage clustering, we can imagine a set of test data where two observations are closer to each other than they are to any of the training observations.

As it stands now, each observation will be predicted separately, by finding the closest training observation and adding to that cluster. It's possible that two close observations nonetheless end up in the different predicted clusters.

For example, on a 1D number line, maybe we have:

---- Trained Cluster A ---------- Test Obs 1 -- Test Obs 2 ----- Trained Cluster B -------

and Test Obs 1 is put in Cluster A, Test Obs 2 in Cluster B.

However, we could imagine doing agglomeration on the test data first, letting close test observations join into clusters before being added together to a training cluster. In the above illustration, Test Obs 1 and Test Obs 2 would join together, then get added together to Cluster B.

There are two prediction principles in conflict here:

  1. Predicting on an observation from the training data should produce the original cluster assignment.
  2. Adding more observations to the test data should not change the prediction process for any individual observation in the test data.

I believe there are situations where it is impossible to respect both of these at once, and so it might make sense to offer a "batch prediction" option.

EmilHvitfeldt commented 1 year ago

I really don't want to violate the above listed prediction principles, so we might have to look into a different method to do "batch prediction".