glevyhas / pix-plot

A WebGL viewer for UMAP or TSNE-clustered images
MIT License
598 stars 139 forks source link

How to speedup the process for new files? #241

Open mostafa8026 opened 3 years ago

mostafa8026 commented 3 years ago

Is there a ways to improve this technique? when I add a new file, I have to redo everything to find similarities. is there a way to speed up the process of adding new fiels?

mostafa8026 commented 3 years ago

any suggestion to implement it by myself appreciated. tnx

duhaime commented 3 years ago

@mostafa8026 Good question!

The data processing pipeline has a few steps, the first of which transforms each image into a vector. The image vectors are computed and cached (in outputs/data/image-vectors) and so can be read directly after the first run, which should greatly expedite processing.

It's also worth noting that one can use a GPU to accelerate the creation of those image vectors. See the segments of the README on CUDA acceleration if that's an option for you.

From there, we need to project the vectors down to 2D for visualization. Right now we create a new UMAP model for this projection each time a user runs the pixplot command. But we could cache the model from the first run and then use it for subsequent runs. The tradeoff here is between model accuracy and performance--using a cached model will make the data less expressive and could potentially refrain from displaying some patterns that are latent in the distribution, but will run faster, while creating a new model each run maximizes data expressivity but slows down processing...

If you're interested in the idea, check out the UMAP docs on projecting new data with an extant model. We have some code for saving models and loading saved models you could consult if you wanted to try using cached models when processing data. If that sounds interesting, please feel free to send a PR and we'll be happy to review and help it get accepted!