Adds a semi-supervised (specifically a combination of supervised and weakly-supervised data) version of weak algorithms

RobinVogel commented 5 years ago

Closes #233

For now I only wrote what I believe to be expected for #233 for the RCA algorithm. It is a simple modification of the supervised version of the RCA. The test is very basic as well.

It is just based on concatenating the weakly supervised information and the weakly supervised information of the transformed labeled data (strongly supervised information).

It is convenient but increases the volume of the code and documentation. There is a random_state parameter passed to the fit function in RCA, it is marked as deprecated and augments the volume of tests needed for the Semi Supervised algorithms. I will check whether a random_state is present in other algorithms, to understand its relevance.

I will do the other algorithms and better tests if we agree on this structure.

bellet commented 4 years ago

Just a quick reminder: "solves" is not part of the keywords that GitHub recognizes to automatically close issues ;-)

bellet commented 4 years ago

I think this creates a major API problem due to the fact that fit takes as input 4 arguments X, y, X_u, chunks where X and y do not generally have the same number of rows as X_u and chunks. This likely breaks compatibility with model selection routines from sklearn.

Furthermore, this strong supervision + weak supervision is not a major use-case in practice. So indeed the overhead induced by introducing new classes, having to test and document them etc, is probably too large compared to the benefits.

I would favor a solution based on helper functions which combine pairs/quadruplets/chunks provided by the user with those generated from labeled data so that users can then easily fit RCA with the output of this helper function. So essentially something similar to what you wrote for RCA but without creating a new class. We can then add a short paragraph to mention the existence of such helper functions in the doc and we're good.

Note: as pointed out by @hansen7 on #233, semi-supervised is probably not the right term to describe this. This is more a combination of supervised and weakly supervised.

bellet commented 4 years ago

Of course I am happy to hear whether @terrytangyuan @perimosocordiae @wdevazelhes have a different opinion

terrytangyuan commented 4 years ago

I agree. In this case API compatibility is more important, especially now that we are in scikit-learn-contrib. We can start with the helper function and if it becomes popular to users we can then re-consider this.

scikit-learn-contrib / metric-learn

Adds a semi-supervised (specifically a combination of supervised and weakly-supervised data) version of weak algorithms #268