posit-dev / ark

Ark, an R kernel
MIT License
159 stars 9 forks source link

Protocol for extending the variables pane #560

Open dfalbel opened 1 day ago

dfalbel commented 1 day ago

This PR is a proposal of a mechanism that allows package authors to implement custom behavior for objects in the variables pane.

A description of how the protocol works is available here: https://github.com/posit-dev/ark/blob/daf9348affc0f1b190928a2fb0cad5bfadaf263c/doc/variables-pane-extending.md

See also #561 for the custom inspect behavior.

Motivation

The main motivation is to allow package authors to improve the display of objects for which the internal R data structure does not show useful information for users. For example, reticulate R obejcts are environments containing a flag and an external pointer (x <- reticulate::py_eval("1", convert = FALSE)):

image

Or torch tensors (x = torch::torch_randn(10, 10)) :

image

Or xml2 documents (x <- xml2::read_xml("<foo> <bar> text <baz/> </bar> </foo>")):

image

How it's implemented

Package authors can implement a set of S3 like methods, each customizing a specific part of the UI. Methods are discovered by ark when packages are loaded and lazily stored in an environment. Package authors don't need to export these methods in their packages. See here for the description of methods.

When responding to events from the variables pane, we try to apply the custom methods and if the method works correctly we use it's value, otherwise we try to continue using the default implementation.

Methods are expllicitly implemented for Ark, so it's expected that package authors will take extra care to make they work correctly with Positron.

Comparison to RStudio

RStudio uses the utils::str S3 in a few situations to obtain the values description, see usages of ValueFromStr in eg: https://github.com/rstudio/rstudio/blob/2175496934a54e21979461a1c67eb13ad5048cf2/src/cpp/session/modules/SessionEnvironment.R#L122

There are many guardrails though, as str methods are commonly implemented and package authors might not be aware of such bugs when used from RStudio.

github-actions[bot] commented 1 day ago

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

t-kalinowski commented 1 day ago

I have read the CLA Document and I hereby sign the CLA

dfalbel commented 13 hours ago

I like the idea of a single ps_variable_proxy.

I think I'd still want to customize the display value though, eg for a torch_tensor, I think the display value should be something like Float [1:10, 1:10], I think the closest proxy would be a length 1 character vector with that text, which would still be displayed with quotes around it:"Float [1:10, 1:10]".

I'm fine with display type = class(x)[1], we could remove display_type.

So the proxy would replace has_children(), get_children() and get_child_at().

My main concern with a single proxy implementation is performance. The way the variables pane is currenty implemented, we don't keep any state after the representations are sent to the front-end and we only get back a vector of access keys from the front-end when we want to expand a node.

Considering we implement the second option where variable_proxy() returns a list-proxy containing classed objects implementing variable_proxy. So suppose that A is an object that implements the variable_proxy() and it's children are also objects that also implement variable_proxy() (For instance, A could be a reticulate object representing a Python dictionary and B a dictionary contained in A).

When we click on B, so we get it's children, we'll receive a request from the front-end containing [A, B]. To compute the children of B, we'll need to:

  1. ps_variable_proxy(A)
  2. find B there and then
  3. ps.variable_proxy(B)
Root
└── A
    ├── B
    │   ├── B1
    │   └── B2
    └── C

If A is large, building it's variable_proxy() can be time consuming and we'll need to rebuild it for every. neste child one expands.

The main reason for having two separate get_children() and get_child() is that there might be an efficient way of finding B in A without building the entire list of possible children.

IMO this will be a rare situation though, except for reticulate, most use cases I can think would only want to show a nice display value. So maybe it's fine to let package authors guarantee that variable_proxy is fast enough, even if they need to implement some kind of caching, etc. Or maybe, we want to keep some state in the variables pane so subsequent requests don't need to recompute the children.

Let me know what you think!

lionel- commented 13 hours ago

hmm if performance is an issue I think we should be able to store the proxies in our current_bindings? These would only be updated if the objects change across top-level calls.

t-kalinowski commented 13 hours ago

I think we can add support for a ps_variable_proxy, which could be in addition to the methods proposed in the PR, as an alternative API. However, I would not want to require that the proxy approach is the only supported API.

The 'base-atomic-proxy' approach creates difficulties for objects that cannot be easily materialized as a base atomic. This is particularly problematic for objects that would be much larger in memory when materialized as a base atomic type, and for objects that have no suitable base-atomic equivalent.

For example, large sparse arrays or arrays with smaller dtypes pose challenges. It's not uncommon to have a single-digit GB-sized NumPy array with a 'float32' dtype. Materializing that as a 'base-atomic-proxy' would mean allocating a double-digit GB double array every time the variables pane is updated. Similarly, if a user has a 10 GB array of int8s, we'd be materializing a 40 GB R integer array with each variables pane update.

(We encounter these same limitations of the vctrs-proxy approach in other places, like https://github.com/rstudio/reticulate/issues/1481)

Many objects also lack a suitable base atomic equivalent. This opens up tricky questions. For instance:

  1. Python AST nodes: import ast; n = ast.Name('x'). Would this need to be presented as a string or an R symbol?

  2. Python futures: fut = concurrent.futures.ThreadPoolExecutor().submit(fn). Would reticulate have to construct an R {coro} or {future} proxy?

  3. Lazy-loaded arrays like https://www.bioconductor.org/packages/release/bioc/html/DelayedArray.html: How would the correct "true" dimensions of such an array be communicated without materializing a full proxy?

lionel- commented 12 hours ago

In my mind the variable pane should never contain GBs of data and any materialised data should be truncated - to support large data we'd need a similar approach to the data viewer and then the proxy becomes a slice proxy.

I see your points regarding flexibility of data types in the context of interop. I think this aligns with Daniel's request to keep a display value extension point?

t-kalinowski commented 12 hours ago

I think this aligns with Daniel's request to keep a display value extension point?

I'm probably misunderstanding, but I don't think it does. A code example might help.

With the proxy approach, what method(s) would reticulate implement to display a NumPy array?

dfalbel commented 12 hours ago

I think for a numpy array, reticulate would only implement display_value and a proxy method that returns anything that's scalar atomic, so the arrow \/ to list children doesn't appear for it in the variables pane (equivalent to current has_children() = FALSE).

Storing proxies in current_bindings is probably possible, but the implementation will be quite tricky, because an object implementing variable_proxy() could be any level nested into an object that is pointed by one of the current_bindings. I can take a look at what that could look like.

lionel- commented 11 hours ago

@dfalbel Feel free to take a look but I think you've both convinced me that the proxy approach would not be a simplification.

dfalbel commented 8 hours ago

Ok, I didn't get to it, but happy to experiment if you think it's worth it. I set up an example package here: https://github.com/dfalbel/testVariablesPaneExtension/tree/main/R with an example that works for R6 classes and some examples of what I'd like to make it work for torch.