Closed RobinVogel closed 5 months ago
Just a quick reminder: "solves" is not part of the keywords that GitHub recognizes to automatically close issues ;-)
I think this creates a major API problem due to the fact that fit
takes as input 4 arguments X, y, X_u, chunks
where X
and y
do not generally have the same number of rows as X_u
and chunks
. This likely breaks compatibility with model selection routines from sklearn.
Furthermore, this strong supervision + weak supervision is not a major use-case in practice. So indeed the overhead induced by introducing new classes, having to test and document them etc, is probably too large compared to the benefits.
I would favor a solution based on helper functions which combine pairs/quadruplets/chunks provided by the user with those generated from labeled data so that users can then easily fit RCA
with the output of this helper function. So essentially something similar to what you wrote for RCA but without creating a new class. We can then add a short paragraph to mention the existence of such helper functions in the doc and we're good.
Note: as pointed out by @hansen7 on #233, semi-supervised is probably not the right term to describe this. This is more a combination of supervised and weakly supervised.
Of course I am happy to hear whether @terrytangyuan @perimosocordiae @wdevazelhes have a different opinion
I agree. In this case API compatibility is more important, especially now that we are in scikit-learn-contrib. We can start with the helper function and if it becomes popular to users we can then re-consider this.
Closes #233
For now I only wrote what I believe to be expected for #233 for the RCA algorithm. It is a simple modification of the supervised version of the RCA. The test is very basic as well.
It is just based on concatenating the weakly supervised information and the weakly supervised information of the transformed labeled data (strongly supervised information).
It is convenient but increases the volume of the code and documentation. There is a
random_state
parameter passed to thefit
function in RCA, it is marked as deprecated and augments the volume of tests needed for the Semi Supervised algorithms. I will check whether arandom_state
is present in other algorithms, to understand its relevance.I will do the other algorithms and better tests if we agree on this structure.