hediet / vscode-debug-visualizer

An extension for VS Code that visualizes data during debugging.
https://marketplace.visualstudio.com/items?itemName=hediet.debug-visualizer
GNU General Public License v3.0
7.9k stars 407 forks source link

Tensor visualizer #102

Open vadimcn opened 3 years ago

vadimcn commented 3 years ago

I often wish there were a decent visualizer for NumPy/PyTorch tensors. Debugging dimension indexing bugs with the way the Python extension displays tensors is a total drag :(

I know that Debug Visualizer has grid visualization, but that only works for 2D tensors with several hundred elements. I routinely need to deal with 4+ dimensional tensors with millions of items, so a simple grid doesn't cut it. There would need to be some sort of summarization along large dimensions and probably a UI for choosing which dimensions to display.

I know, this is quite a bit to ask, but if Debug Visualizer had that, I suspect it would be a lot more popular with data scientists!

hediet commented 3 years ago

Awesome idea! I'm not into python though. Can you describe exactly which data you want to visualize and how the visualization should look like?

For millions of data points it might make sense to preprocess them with python, so that sending the data to the debug visualizer stays fast.

vadimcn commented 3 years ago

Awesome idea! I'm not into python though. Can you describe exactly which data you want to visualize and how the visualization should look like?

I don't have a very concrete idea about how it should be rendered. Maybe something like Excel crossed with illustrations in this TensorFlow tutorial...

Another idea can be taken from the way NumPy abbreviates large tensors:

       [[94, 17, 95, ..., 86, 58,  6],
        [98, 31, 48, ..., 34, 82,  1],
        [38, 19, 56, ..., 44, 71, 65],
        ...,
        [40, 10, 41, ..., 97, 42, 49],
        [97, 41, 81, ..., 80, 41, 61],
        [94, 87, 93, ..., 12, 76, 17]],

In interactive UI this could be re-cast as freezing a few first/last rows and columns, but allowing the rest to be scrolled.

Also, since, realistically, you can show only 3-4 dimensions at a time, the displayed dimensions need to be selectable.

For millions of data points it might make sense to preprocess them with python, so that sending the data to the debug visualizer stays fast.

Yeah, I think there will need to be some sort of request-response protocol between the data provider and the extension. For example, you could extend the schema returned by expressionTemplate with { "kind": "tensor", "dtype": "bool|int|float|complex|string|...", "shape": [<list of dimension sizes>], "handle": <any> }, and then have another expression template for requesting a subslice of the tensor (taking tensor handle and a list of dimension ranges).