Open BoPeng opened 6 years ago
For how data should communicate, I have some concerns RE dedicating to to=lan
too early. Unlike the case for SoS Notebook where flow of logic is linear, usually we will create variables in language A and obtain it in B, then to=lan
makes sense. But in the context of workflows:
[A_1,B_1]
some scripts in language A
sos_export(to=?)
[B_2]
some scripts in language B
[A_2]
some scripts in language A
Then you see for workflows A and B we have troubles determine in *_1
which format to export to and the only way out is to use the general format regardless of whether or not lan
is supported.
So this is another reason why destination-aware format export should come last, and should share interface with the more general approach so that there will involve minimal changes to code when switching between formats.
The to
parameter is by design optional so users can always do export(data)
to save data
in a portable manner (which in case of sos notebook means exporting to the sos
kernel), with limited datatypes and/or more information loss, poorer performance etc.
In theory we can create a really powerful language-independent format but creating bindings for all supported languages can be a real pain. So in my imagination, we should start from a small "core" set of datatypes (e.g. JSON) that can be easily implemented by all supported languages. Then we can work on point-to-point exchanger to allow more types. Once we have a set of point-to-point exchanger that works for all supported languages, we can move the exchangers to the language-independent part so that users can exchange this datatype without the to
parameter.
I mean, it is better to start from a small number of data types and as many languages as possible, than from a large number of datatypes and a small number of languages. Actually, if you google, you can find a ton of formats for the latter.
BTW, export(data)
should support as many datatype from the source language as possible regardless of information loss and performance. For example, named tuples, time series can be exported to lists etc, date can be exported as string or float number, because in my opinion having the data transferred is more important than having them transferred in a lossless manner because we have done at least part of the job and users can always post process the transferred data (e.g. convert string to date).
This however means problems with backward compatibility (e.g. data was transfered as string and now date), which is, as I have said before, a larger problem for dataexchanger than sos notebook.
So along the lines of small core data types JSON is already good for recursively storing atomic types across languages. That only leaves out one (or two) basic types: matrix (or additionally data frames). If as a first pass we are going to support JSON + matrix / dataframes it is going to satisfy a lot applications. But matrix can be large and the only efficient common interface is HDF5. ... So are we back to some HDF5 wrapper?
For those we will need some binary format such as Apache arrow / feather. The problem is that they do not have a lot of language bindings yet.
Okay but Apache Arrow is not recursive (hierarchical) right? Are you saying we can use JSON on top of it ?
We do not have to support recursive format. I mean, we can start from primitive types and array and dictionary of primitive types. Matrix, named arrays and collection of named arrays (dataframe) of primitive types can be added later.The problem with recursive types is that it might contain unsupported types and then the entire data is not transferrable.
Oh okay but recursive format could be important in some applications. For example in methods developments sometimes we do not know what exactly we should look at, so we dump everything out, typically recursively, to save in preparation for future needs. Examples include for an EM algorithm various quantities from every iteration. That is why R users at least has the tendency to always use recursive structure and saveRDS
to serialize them altogether. I think we might have to think about what we user case we target at.
I think feather is a good model for implementation, although unfortunately feather has not been developed for a while with many pending requests (including my ticket). Although a common interface around JSON for many languages can be a good start, we could aim higher and base our format on arrow/feather. I believe they should have decent support for primitive formats even dictionaries, and we just need to add those to feather.
As you can imagine, implementing something in c/c++ and writing bindings for all supported languages can be a lot of work. Without major support json+csv is a much better starting point.
I understand. From my prospective if there is recursive support for json+csv then I can prototype and use for benchmarking to help making a compiling case that we need and we can do better.
Then maybe messagepack and be a drop in replacement?
The goal of this project is to provide a simple and uniform data exchange mechanism among scripting languages. Ideally the interface could be
and
for all languages.
Following a basic design of sos notebook's data exchange mechanism, function
sos_export
wouldlan
is supported, writevariable
tofilename
suitable to be imported bylan
lan
is not supported, writevariable
tofilename
in a general format that could be an extension of JSON.Function
sos_import
wouldfilename
is designed forlan
, read data in a more efficient manner.filename
is not designed forlan
, read data fromJSON
-like format.So for all supported languages we should define a module or library that
The step-by-step design of the project lies in the fact that