Open 10Dev opened 5 years ago
As I have explained, we have a data exchange model that exchanges variables between live kernels with the help of a SoS kernel. This allows the creation of multi-language "workflow" in Jupyter. For example, in a notebook, one can process some data in bash, gather results and analyze them in Python, transfer the processed data to R and plot them. The "Run all cells" action will essentially run the "workflow".
Then we have the SoS workflow engine for batch data processing, which is designed to be as close to the interactive "run all cells" workflow as possible so that users can migrate from a notebook workflow to a formal command line workflow as easy as possible, as described in my JupyterCon talk starting at 22:45s.
However, the inter-language data exchange method in sos notebook no longer works for the sos workflow system because there is no live kernels during the execution of sos workflows. We had some intense discussions about this and thought that a "sos-dataexchange" component could solve this problem, but then we realized that this is a huge topic that requires substantial effort to implement, and it can be applied beyond the SoS workflow system.
The result of the discussion was that we should formalize our ideas and apply for some support to implement them, via foundation grants and/or google summer of code.
BTW, I was happy to see the DataBus proposal because it has a lot in common with what I had in mind with the sos-dataexchange
project. Because the complexity and decentralized nature of the project, I believe that it would be a lot better for the JupyterLab team to take the lead because they already got some funding for the project, have a lot more brain power than my group, and a lot more influence in the community to attract contributions and push the project forward, and the end product could be used almost in place for SoS.
Here is a start for your Wiki or your Readme:
Many thanks to @10Dev for including me in the discussion. The proposed DataBus is at the JupyterLab level, it is mostly designed for extensions that consume dataframe-like data, but I suppose language kernels could make use of DataBus later if an API is provided, and it can be expanded to support more datatypes. In that case any kernel could use some magics to read from and write to the bus and exchange data with the frontend and other kernels. This is brilliant!
Anyway, before DataBus becomes available, I would like to write a bit about how SoS does a similar thing to exchange data between multiple kernels in the same notebook. Basically, SoS is a super kernel that allows the use of multiple kernels in one notebook, and allows the exchange of variables among them. Using a %get magic in the format of
(in kernel_A) %get var_name --from kernel_B SoS creates an independent homonymous variables in kernel_A with similar type to the variable in kernel_B. This currently works for kernels for 11 languges and for most native data types, and requires no modification to Jupyter or supported kernels.
Under the hood, SoS defines language modules for each language (e.g. sos-r, sos-python) that "understand" the data types of the language and assist the transfer of variables directly or by way of the SoS (python3) kernel. More specifically, when
%get mtcars --from R is executed from a kernel, SoS would run a piece of code (hidden to users) to save mtcars to a temporary feather file (based on apache arrow), and run another piece of code in the destination kernel to load it. Simpler datatypes can be transferred directly via memory.
This design is non-centric and incremental in the sense that
There is no central data bus because kernel_A can transfer data directly to kernel_B. This is no guarantee of lossless data transfer because for example Julia does not yet support row label of data frames, so row labels will be lost if it gets mtcars from R. It can in theory support the transfer of any data types in any language by expanding the language modules (e.g. types such as Series, slice, Range). This also means a language module can be added to support only a few major data types and expand as needs arise. I can imagine that SoS can make use of DataBus to expand the data exchange capacity to frontends, and assist the data exchange among kernels, so I will be happy to assist/participate in the development of DataBus. Actually, we ourselves have tried to conceptualize a similar project for data exchange between languages (sos-dataexchange) outside of Jupyter, which could benefit from the DataBus project.
I presented the data exchange feature of SoS in my JupyterCon talk in August. You can check out the youtube video (start from 7 minute) if you are interested.
Allow me to propose another idea we had during the brainstorm of the sos-dataexchange project.
How about implementing DataBus as a separate project?
Here is how it might work:
Implement DataBus as a data warehouse sort of project that is independent of JupyterLab. DataBus can "consume" data or "interface" data. In the former case DataBus accepts and holds the content of the data, in the latter case DataBus knows how to access the data with the passed meta information. In the extreme case a DataBus can connect to other (public, remote, etc) DataBuses. When a DataBus daemon is started, it exposes a (few) (zmq) communication channels. A protocol is defined to talk to the daemon to send and receive entire or pieces of "data" in certain ways. Individual languages, Jupyter kernels, JupyterLab extensions would implement their own libraries to talk to the DataBus. On the JupyterLab side, it can start a DataBus instance or connect to an existing DataBus instance and let the rest of the components talk to the DataBus by themselves. The advantages?
It can be used without JupyterLab. Thinking of a scenario that users can start a databus instance and run a workflow that consists of steps in different scripting languages and use databus to exchange data. This is basically the motivation for the sos-dataexchange project. It decentralizes the implementation because each language, each data source (e.g. hdf5) can define its own library to work with DataBus. The core of DataBus would be the protocol which can be implemented in different ways. I would also imagine more interests from the community if it has a broader scope. The possibility of chaining DataBuses or connecting a DataBus to multiple DataBuses can revolutionize the way we work with distributed datasets. In the case of JupyterLab, an extension can handle/visualize data from arbitrary databuses, not necessarily the one provided by JupyterLab.