pedroilidio / bipartite_learn

BSD 3-Clause "New" or "Revised" License
3 stars 1 forks source link

Dealing with bipartite/monopartite datasets can be improved #7

Open pedroilidio opened 1 year ago

pedroilidio commented 1 year ago

In different scenarios, we sometimes deal with bipartite formatted data, but sometimes the bipartite datasets are converted to the monopartite form, monopartite meaning that X is formed by pairwise concatenations of all possible X[0] and X[1] rows, and bipartite meaning X = [X[0], X[1]].

As mentioned in https://github.com/pedroilidio/bipartite_learn/issues/5#issuecomment-1509892869, the way of distinguishing between these two formats deserves more careful solutions than what we currently do:

https://github.com/pedroilidio/bipartite_learn/blob/84998676b7847c5564cdacfc1d43d269e4eb6140/bipartite_learn/utils/__init__.py#L40-L42

Even more so since some estimators do accept both types of input for predict() (tree-based models in general) while others only accept the bipartite format (the matrix factorization ones, for instance), but all of them should yield flattened predictions for better integration with scikit-learn scoring utilities, which I reckon can be quite confusing.

  1. I suppose an estimator tag would be an appropriate way of signaling that.
  2. Maybe a whole Dataset class would facilitate maintenance in the long term.