Closed ablaom closed 1 year ago
Great implementation. Thanks for considering this.
One minor issue is resolved by #9.
If you can, I suggest you drop DataFrames as a test dependency, and just use Julia native tables, with the help of Tables = MLJBase.Tables if you really need it.
Other points to consider:
I wonder about whether tables as input and output make sense in this particular case.
For output maybe a matrix makes more sense than a table with contrived labels (node1, etc ). Or, maybe a SimpleGraph from Graph.jl, which would open up a bunch of visualization possibilities and remove some ambiguity about how to interpret the output. Of course, you might consider that out-of-scope for a first release. In that case, be sure to document how the user is to interpret the output (which depends on the topology, right?).
There isn't a scitype for graphs, but we could add one. You would just leave that unspecified for now, as the fallback Unknown
will pass through any type checks.
For input, maybe accept table or matrix, which you can accommodate with
input_scitype = Union{AbstractMatrix{Continuous}, Table(Continuous)}
You will also need an MLJ-compliant document string for your model. There are examples you can follow.
You can find an example at EvoTrees.jl. You can avoid the clumsy comparison of old and new model there with the help of MLJModelInterace.is_same_except
method. The idea is that update
checks if the current model
and a deep copy you pass to cache
(algorithm "state") are the same, except for the number of iterations. If so, you just call n_new - n_old
more iterations (if positive) on the existing fitresult
. Otherwise, you call fit
.
Very, sorry, I seemed to have missed these latest developments. Are you ready for me to review again?
Copying from slack discussion for my own reference:
I originally thought to return the index of the winning node as the class label, but figured it’s probably more useful to return the full array of distances.
@ablaom replies: On further reflection, I think the most natural thing to return would be the location, in the plane, of the winning node’s location in the input space of the map. So transform is just dimension reduction, like in PCA. In the case of the sphere, you’d give the polar coordinates, but I expect you want to reduce the polar angles mod 2pi for plotting pbburposes (currently not the case for the full grid of input points in som.coords.) One can always get the full list of pairwise distances presently returned by transform if you want them, by accessing the fitted parameters, or you could also add them to the report . By the way, currently fitted_params is only exposing the weights (“output points”) but I would add the matrix of all input points as well (som.coords, reduced mod 2pi) .
@john-waczak Your changes look good. I've made a suggestion here.
If you get stuck with the update
let me know.
Again thanks for your patience.
@john-waczak If I recall correctly, this package was close to being a green light for MLJ integration. Be great to get it over the finish line. The update
business could certainly wait for a later date.
To do:
Realizing I forgot to commit the change to return spherical coordinates mod $2\pi$ per previous discussion. Also, the package should be registered now!
@john-waczak Ping me after you merge #14 and tag a new release and I'l move on to the next step.
I was going to tag the SOM models as "dimension reduction" but can add other tags from this list. This effects the categories for the Model Browser listing. Let me know.
@ablaom I've gone ahead and tagged the release (haven't used TagBot before, so that was a good learning experience!). I think :dimension_reduction
makes sense as well as :clustering
since one can use the index of the best match unit for a given input as the unsupervised class label.
makes sense as well as :clustering since one can use the index of the best match unit for a given input as the unsupervised class label.
Yes, you are right. Perhaps, at some later date we can implement predict
to return (a categorical
ized) version of those labels. That would bring it into line with other clustering algorithms.
I will still add the tag , as the categories are not meant as strict classifications based on interface.
Great! I added an issue for it. I'm planning on using this for a project shortly and will need the functionality so I'll add the predict method then.
@ablaom assigning self.