modAL-python / modAL

A modular active learning framework for Python
https://modAL-python.github.io/
MIT License
2.24k stars 324 forks source link

Compute pairwise distances only once instead of each turn. #68

Open AlexandreAbraham opened 4 years ago

AlexandreAbraham commented 4 years ago

This is a proposition of optimization for the ranked batch method. In the ranked batch methods, the distance between all unlabeled samples is computed at each iteration which is computationally prohibitive on large datasets. This modification simply update the minimum distance vector with each selected sample. Note that I do not remove the labelled sample from the vector to avoid memory reallocation and unnecessary complexity. However, one sample will not be selected twice as the distance to itself is 0.

I observe a difference in the performance on the ranked_batch example though.

Here is the performance history given master:

[0.3333333333333333, 0.8933333333333333, 0.84, 0.9266666666666666, 0.9333333333333333, 0.9466666666666667, 0.9533333333333334]

Here is the performance history on my branch: [0.3333333333333333, 0.8733333333333333, 0.8733333333333333, 0.9333333333333333, 0.9333333333333333, 0.9466666666666667, 0.9733333333333334]

I am still investigating this. Any insight is welcome.

cosmic-cortex commented 4 years ago

Hi! Thanks for the PR! I'll try to remove it soon and get back to you.

cosmic-cortex commented 4 years ago

Just a quick update, I am still working on reviewing it, not sure if I understand everything correctly, so I need to spend more time with it.

I have also noticed that in my previous comment, I wrote "I'll try to remove it soon" :D I actually meant to say I'll try to review it soon. Sorry for the potential confusion :)

AlexandreAbraham commented 4 years ago

Hey, No problem, let me know of I can add comments to make it more readable.