lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.46k stars 809 forks source link

Implementation of sciki-learn's get_feature_names_out() API is not correct #1082

Open buhrmann opened 11 months ago

buhrmann commented 11 months ago

I was happy to see the get_feature_names_out API is already supported, but noticed that the implementation is not correct.

The scikit-learn API for get_feature_names_out(...) accepts optional input feature names, and should generate valid output feature names:

The current implementation seems to assume that the optional argument passes output names and simply forwards them if provided. This fails in most cases, since scikit-learn pipelines etc. will pass in input names. Specifically, if passed in, this will be an array with as many names as features in the input. This is almost never the correct number for the output, since UMAP reduces the dimensionality of course. E.g. if I reduce a dataset with 100 original dimensions to 2 final dimensions, the current implementation will forward 100 feature names for an output with only two features.

The simplest fix would be to simply always return a list of names of the correct length, no matter the optional argument.

A slightly more sklearn-ish implementation would be to use the ClassNamePrefixFeaturesOutMixin.