Open AlexandreAbraham opened 4 years ago
Hi! Thanks for the PR! I'll try to remove it soon and get back to you.
Just a quick update, I am still working on reviewing it, not sure if I understand everything correctly, so I need to spend more time with it.
I have also noticed that in my previous comment, I wrote "I'll try to remove it soon" :D I actually meant to say I'll try to review it soon. Sorry for the potential confusion :)
Hey, No problem, let me know of I can add comments to make it more readable.
This is a proposition of optimization for the ranked batch method. In the ranked batch methods, the distance between all unlabeled samples is computed at each iteration which is computationally prohibitive on large datasets. This modification simply update the minimum distance vector with each selected sample. Note that I do not remove the labelled sample from the vector to avoid memory reallocation and unnecessary complexity. However, one sample will not be selected twice as the distance to itself is 0.
I observe a difference in the performance on the ranked_batch example though.
Here is the performance history given master:
[0.3333333333333333, 0.8933333333333333, 0.84, 0.9266666666666666, 0.9333333333333333, 0.9466666666666667, 0.9533333333333334]
Here is the performance history on my branch:
[0.3333333333333333, 0.8733333333333333, 0.8733333333333333, 0.9333333333333333, 0.9333333333333333, 0.9466666666666667, 0.9733333333333334]
I am still investigating this. Any insight is welcome.