API design - Githubissues

Fil commented 3 years ago

With the current API, if one wants to project in d=3, one has to know the exact number n of optional arguments before specifying 3 as the n+1th argument. This feels a bit uneasy, and it means that we can't add a supplementary hyperparameter to any method without it being a breaking change.

It seems to be that it would be nice to rethink the API "à la D3", so that:

all the algorithms can be called interchangeably
we could separate the training and transform phases (#11)
we could specify hyperparameters individually
we could serialize the model (in and out : save and load)

I would imagine that this could be structured as:

new Druid([method or model]) — create a druid
druid.values([accessor]) — sets the values accessor if specified, and returns the druid; return the values accessor if not specified
druid.dimensions([number]) — sets or returns the dimensions (default: 2)
druid.class([accessor]) — sets or returns the class accessor (for LDA)
druid.method([name or class]) — sets the current method (UMAP, FASTMAP etc) if specified and returns the druid ; if not specified, return the method (as a Class or function).
druid.fit(data) — train the model on the data and returns the druid
druid.transform([data]) — transforms the data if specified; if data is not specified, returns the transformed train set
druid.model([model]) — returns the serialized model (JSON) if a model is not specified, loads the model if specified

And for each hyperparameter, for example UMAP/min_dist

druid.min_dist([min_dist]) — if specified, sets the min_dist hyperparameter and returns the druid, or read it if not specified

With this we could say for example:

const dr = new Druid("LDA"); // dr
dr.dimensions(2).class(d => d.species).values(d => [+d.sepal_length, +d.petal_length, …]).fit(data); // dr
dr.transform(); // transformed data
const model = dr.model(); // JSON {}
…
const dr = new Druid(model); // dr
dr.transform([new data]); // apply the model to new data…

I wonder what should be done for NaN, I suppose they should be automatically ignored if the values accessor returns any NaN.

Note also that some methods such as UMAP can accept a distance matrix instead of a data array.

PS: Sorry for spamming your project :) The potential is very exciting.

Fil commented 3 years ago

Update: changed train to fit in order to match sklearn API

saehm commented 3 years ago

You are right, with the current API you have to know the parameters. The idea was, that if you want to change the dimensionality, or the used metric you have to know what you are doing anyways. But, a DR object already has a function druid.parameter("parameter_name", [parameter_value]) - (with two aliases "para" and "p") where you can set a parameter (which is chainable, similar to d3's attr function). But it would be no problem to use getters and setter. Checking the parameters would be probably easier that way.

Druid has already some other things implemented, some clustering-, k-NN-, and linear algebra implementations --- some of them doesn't work that well yet ;). Therefore we could maybe change the DR constructor to take a String for the name of the DR method for example const dr = new druid.DR("LDA");. (For now it works if you use const dr = new druid["LDA"];)

I like the values function very much, we should add this :), also the function to set dimensionality and additionally one for changing the metric function.

As I mentioned in issue #11 a fit or train method will not work with most of the DR methods. Maybe we could add it for those DR methods where it works?

Fil commented 3 years ago

Ah I hadn't seen the chainable .parameter method, now I see it!

new Druid.UMAP(data).parameter("min_dist", 2).transform()

however it feels a bit strange to parametrize after adding the values. And it seems you can't use .parameter("d", 3) to change the dimensionality?

you have to know what you are doing anyways

I disagree :) I love to learn by testing things out, and it's frustrating if they break for no apparent reason. You can see that in the "hello" notebook: it needs quite a bit of code to inject the default values. And if you're trying to go 3D, they are not optional.

PS: I admit I haven't paid attention yet to the clustering methods (and others); my comments so far are meant only for the DR methods of the API. But I'm curious about them and waiting for some examples or documentation to appear :)

Fil commented 3 years ago

I would also very much like all DR methods to return a generator, even if it only yields one final result. Otherwise we have to do something like this in user space:

return typeof D.generator === "function" ? D.generator() : D.transform();

Ref. https://observablehq.com/@fil/druidjs-worker

EDIT: solved in 0.7.3

hydrosquall commented 3 years ago

I would find having a uniform way to set the dimensions parameter across all algorithms very helpful, as I came to this issues page specifically to find out whether that was possible, as I wanted to try projecting to 3D.

I didn't see a d parameter available in the ObservableHQ notebook example - is this a capability that's currently available but not documented, or something new to implement?

saehm / DruidJS

API design #13