tensorflow / tensorboard

TensorFlow's Visualization Toolkit
Apache License 2.0
6.69k stars 1.66k forks source link

Embedding projector only loads first 100,000 vectors #773

Open vitalyli opened 6 years ago

vitalyli commented 6 years ago

Embedding projector only loads first 100000 vectors. In many real world applications, embedding dictionaries are well over 1Mil. Need some way to display vectors from larger sets or at least have a way to configure what the upper limit is.

vitalyli commented 6 years ago

It appears that this limit is hardcoded here: .//tensorboard/plugins/projector/vz_projector/data-provider-server.ts export const LIMIT_NUM_POINTS = 100000;

jart commented 6 years ago

Everything in the projector is done on the client side. There's a limit to how much the browser can handle. I'd be interested in hearing about whether or not things worked out if you changed the limit by hand.

vitalyli commented 6 years ago

I tried to change this limit, but client still said showing first 100k, which made me wonder if server dictates that limit. Or is that cached somewhere in the browser cache perhaps? Would be good to be able to send that limit as parameter to the server. Often it's about searching for a vector by label and looking for closest vectors; if it simply takes first 100k, that means it's limited in what can be explored given 1mil plus embedding file. May be distance compute can be pushed to the server, thus removing need for client to do the filtering altogether.

vitalyli commented 6 years ago

If we can't make client handle more than 100k what would be really useful is to tell server to sample data instead of returning first 100k. Think of data sorted by popularity, always seeing first 100k out of 1mil is biased towards more popular items. Ideally server would return a stratified sample of random 10k from each 100k and so it would give a good representative sample from 1Mil.

Seanspt commented 6 years ago

Up vote for sampling instead of return first 10k. Also, it would be great that a group of wanted IDs could be passed in.

nfelt commented 6 years ago

We'd welcome a contribution to implement server-side sampling if someone wants to take this on.

kapilkd13 commented 6 years ago

Hi @nfelt I would like to take this. Can you point me to the files corresponding to the embedding projector. Also any suggstions/ideas?

rahulkrishnan98 commented 5 years ago

@vitalyli once we run the projector on 100000+ vectors and metadata, it can limit the vector and sample 100000+ points on it, but the metadata for even the loaded points fail.

hvout commented 5 years ago

Hello. Sorry for bringing this up but the folder .//tensorboard/plugins/projector/vz_projector/ does not exist in my installation (installed with pip inside miniconda venv with python 3.6 - latest tensorboard version). Anyone knows where I can find that folder to increase the limit?

hvout commented 5 years ago

I'm able to increase it in the projector_plugin.py file under tensorboard/plugins/projector and it does work. But T-SNE and PCA keep sampling data for "faster results" - I believe these limits are set in data.ts but when installed with pip the vz_projector folder does not exist

alexdevmotion commented 4 years ago

I also have this issue, has anyone found an easy fix?

RSKothari commented 4 years ago

Hey guys, any luck on this topic? In my case, it only samples 120 data points. A tip I could perhaps offer to speed things up would be to offer a "PCA + tSNE" option. It could drastically help reduce embedding sizes and reduce the load on RAM via browser.

nlp4whp commented 4 years ago

I'm able to increase it in the projector_plugin.py file under tensorboard/plugins/projector and it does work. But T-SNE and PCA keep sampling data for "faster results" - I believe these limits are set in data.ts but when installed with pip the vz_projector folder does not exist

you are right ... it look like we have to modify something in data.ts for PCA and T-SNE sampling

although 10k is defined in data-provider-server.ts as export const LIMIT_NUM_POINTS = 100000;, but it will be sent to back-end in projector_plugin.py where the final tensor is returned (see _serve_metadata(self, request) or _serve_tensor(self, request))

GeorgePearse commented 2 years ago

@RSKothari @nlp4whp It might even make a lot of sense to fork the embedding projector component and remove the in-browser interactive dimensionality reduction (to be replaced with whatever dimensionality reduction technique a data scientist wants to use ahead of time. The embedding projector has a lot of value on its own as a high-performance 3d visualization tool with convenient access to metadata. Unless people know a better alternative for pointclouds with metadata?

wizz92 commented 1 year ago

You just need to change qO=1e5 to qO=1e6 in /tensorboard/plugins/projector/tf_projector_plugin/projector_binary.js worked fine for me.

saikot-paul commented 12 months ago

Is it possible to change any of these parameters in a colab environment?

arcra commented 11 months ago

Given that this currently requires modifications to the source code, there is no way to change this behavior with the supported extension from colab. There might be ways to use a custom version of tensorboard with a "local runtime" in colab, but I'm not knowledgeable enough about colab to provide any guidance in that regard.

If a locally modified version of a tensorboard would be sufficient (i.e. just running a standalone TB, not in colab), you can take a look at our DEVELOPMENT guide for some pointers on how to run a local instance.

With respect to better supporting this as a feature in a feature release, I'm afraid it's unlikely we'll prioritize this, especially because there hasn't been any active development on this area/plugin, there are no people left in the team who are familiar with this part of the code, and it's probably not an easy thing to solve in a generic way (e.g. without affecting performance on some browsers/machines, and/or some UI support to allow users to configure these visualization parameters, etc.). If anybody is interested in contributing, they can get in touch with us.