Closed krassowski closed 5 months ago
The content_type
and file_format
appear redundant to me. Notebooks will always have json
file format and notebook
content type. Can a file get a json
file_format?
Jupyter server's contents manager needs this information to access a file, so I guess the redundancy comes from there. I've always been confused by that, and I don't even think the server should make a special case for notebooks.
I compared using cloning vs not using it for the usecase from https://github.com/jupyterlab/jupyter-ai/pull/708:
test case | mean | median | stdev |
---|---|---|---|
5 cell notebook, no cloning | 0.59ms | 0.56ms | 0.18ms |
5 cell notebook, cloning | 0.79ms | 0.77ms | 0.06ms |
100 cell notebook, no cloning | 7.63ms | 7.36ms | 0.93ms |
100 cell notebook, cloning | 7.30ms | 7.16ms | 0.56ms |
It appears that making a copy of the shared model has negligible cost compared to the cost of iterating the notebook and extracting the code / concatenating it into a prefix and suffix. Still, for 100 cells I expected it to be much faster than 7ms.
Maybe the cost is in calling to_py()
repeatedly? On the other hand, I do not want to convert everything to a JSON/Python dict straightaway because outputs may be have GBs of irrelevant data. Something to investigate further.
Edit: here are the results - using YNotebook.get()
is much more expensive even if the notebook has no outputs at all (because each cell has metadata).
test case | mean | median | stdev |
---|---|---|---|
100 cell notebook, cloning, .get() |
29.68ms | 28.80ms | 2.78ms |
100 cell notebook, cloning | 8.46ms | 7.87ms | 1.48ms |
When using 10 000 iterations in a microbenchmark I sometimes see:
Traceback (most recent call last):
File "/lib/python3.10/site-packages/distributed/profile.py", line 367, in _watch
del frame
RuntimeError: pycrdt::map::Map is unsendable, but is being dropped on another thread
Does it mean that I need to respect locks even if only reading the content (in an async function)?
pycrdt 0.8.17
pycrdt-websocket 0.12.7
This only means that pycrdt
objects should be created, used, and deleted in the same thread. This is stricter than just not being thread-safe.
In your case I see you're using Dask which probably uses threads, and the object is garbage-collected in a different thread than the one in which it was created. I also see this error in pytest sometimes. This issue gives some options to fix it.
I'm thinking that a way to have read-only access to a shared document could be by using a transport layer that doesn't let a client send their changes. We could for instance create a document server/provider that uses Server-Sent Events.
Problem
Other server extensions would like to use the shared document model to access the content of the notebook without the overhead of sending it over from the frontend, and without the risk of using an outdated (or even inaccessible) copy stored on the disk.
Currently, while this is possible as demonstrated in https://github.com/jupyterlab/jupyter-ai/pull/708, it requires:
_document
attribute of theDocumentRoom
encode_file_path
and other some black magic (which has no guarantee of working if the id format changes in a future release ofjupyter-collaboration
)Proposed Solution
Expose a new
get_document()
method on theYDocExtension
. It could look like:One question raised by @3coins on the server call was whether such API should return a read-only view of the document:
YNotebook
subclass this could mean an encapsulation hiding methods such asappend_cell
andset_cell
. This would require defining read-only variants/wrapper for each of the data structures defined in jupyter-ydoc.Task list:
Additional context
This would follow the same pattern as used by
jupyter-server-fileid
which exposes a small (two methods) public API as documented here.