john-waczak / SelfOrganizingMaps.jl

MIT License
7 stars 1 forks source link

Review MLJ interface #8

Closed ablaom closed 1 year ago

ablaom commented 2 years ago

@ablaom assigning self.

ablaom commented 2 years ago

A review of the MLJ interface

Great implementation. Thanks for considering this.

One minor issue is resolved by #9.

If you can, I suggest you drop DataFrames as a test dependency, and just use Julia native tables, with the help of Tables = MLJBase.Tables if you really need it.

Other points to consider:

I wonder about whether tables as input and output make sense in this particular case.

For output maybe a matrix makes more sense than a table with contrived labels (node1, etc ). Or, maybe a SimpleGraph from Graph.jl, which would open up a bunch of visualization possibilities and remove some ambiguity about how to interpret the output. Of course, you might consider that out-of-scope for a first release. In that case, be sure to document how the user is to interpret the output (which depends on the topology, right?). There isn't a scitype for graphs, but we could add one. You would just leave that unspecified for now, as the fallback Unknown will pass through any type checks.

For input, maybe accept table or matrix, which you can accommodate with

input_scitype = Union{AbstractMatrix{Continuous}, Table(Continuous)}

You will also need an MLJ-compliant document string for your model. There are examples you can follow.

You can find an example at EvoTrees.jl. You can avoid the clumsy comparison of old and new model there with the help of MLJModelInterace.is_same_except method. The idea is that update checks if the current model and a deep copy you pass to cache (algorithm "state") are the same, except for the number of iterations. If so, you just call n_new - n_old more iterations (if positive) on the existing fitresult. Otherwise, you call fit.

ablaom commented 2 years ago

Very, sorry, I seemed to have missed these latest developments. Are you ready for me to review again?

ablaom commented 2 years ago

Copying from slack discussion for my own reference:

I originally thought to return the index of the winning node as the class label, but figured it’s probably more useful to return the full array of distances.

@ablaom replies: On further reflection, I think the most natural thing to return would be the location, in the plane, of the winning node’s location in the input space of the map. So transform is just dimension reduction, like in PCA. In the case of the sphere, you’d give the polar coordinates, but I expect you want to reduce the polar angles mod 2pi for plotting pbburposes (currently not the case for the full grid of input points in som.coords.) One can always get the full list of pairwise distances presently returned by transform if you want them, by accessing the fitted parameters, or you could also add them to the report . By the way, currently fitted_params is only exposing the weights (“output points”) but I would add the matrix of all input points as well (som.coords, reduced mod 2pi) .

ablaom commented 2 years ago

@john-waczak Your changes look good. I've made a suggestion here.

If you get stuck with the update let me know.

Again thanks for your patience.

ablaom commented 1 year ago

@john-waczak If I recall correctly, this package was close to being a green light for MLJ integration. Be great to get it over the finish line. The update business could certainly wait for a later date.

ablaom commented 1 year ago

To do:

john-waczak commented 1 year ago

Realizing I forgot to commit the change to return spherical coordinates mod $2\pi$ per previous discussion. Also, the package should be registered now!

ablaom commented 1 year ago

@john-waczak Ping me after you merge #14 and tag a new release and I'l move on to the next step.

I was going to tag the SOM models as "dimension reduction" but can add other tags from this list. This effects the categories for the Model Browser listing. Let me know.

john-waczak commented 1 year ago

@ablaom I've gone ahead and tagged the release (haven't used TagBot before, so that was a good learning experience!). I think :dimension_reduction makes sense as well as :clustering since one can use the index of the best match unit for a given input as the unsupervised class label.

ablaom commented 1 year ago

makes sense as well as :clustering since one can use the index of the best match unit for a given input as the unsupervised class label.

Yes, you are right. Perhaps, at some later date we can implement predict to return (a categoricalized) version of those labels. That would bring it into line with other clustering algorithms.

I will still add the tag , as the categories are not meant as strict classifications based on interface.

john-waczak commented 1 year ago

Great! I added an issue for it. I'm planning on using this for a project shortly and will need the functionality so I'll add the predict method then.