vatlab / SoS-DataBus

A universal data exchanger among scripting languages
2 stars 0 forks source link

Initial interface #2

Open BoPeng opened 6 years ago

BoPeng commented 6 years ago

The goal of this project is to provide a simple and uniform data exchange mechanism among scripting languages. Ideally the interface could be

sos_export(variable, filename, to=lan)

and

sos_import(filename)

for all languages.

Following a basic design of sos notebook's data exchange mechanism, function sos_export would

  1. if lan is supported, write variable to filename suitable to be imported by lan
  2. If lan is not supported, write variable to filename in a general format that could be an extension of JSON.

Function sos_import would

  1. If filename is designed for lan, read data in a more efficient manner.
  2. if filename is not designed for lan, read data from JSON-like format.

So for all supported languages we should define a module or library that

  1. Import data from JSON-like format.
  2. Import data from some sort of native format, which can be the language's native format (e.g. pickle), or something defined by sos, or some datatype dependent format (such as feather).
  3. Export data to JSON-like format.
  4. Export data in a destination-aware datatype-dependent format.
  5. As a special case, export/import language's own native format (e.g. pickle) should be supported.

The step-by-step design of the project lies in the fact that

  1. The modules can start with 1, and 3, and will then work with all other languages with limited capacity.
  2. item 5 can be used to provide a uniform interface for within-language seralization
  3. Item 2 and 4 needs collaboration between sending and receiving languages so two sides need to be implemented together.
gaow commented 6 years ago

For how data should communicate, I have some concerns RE dedicating to to=lan too early. Unlike the case for SoS Notebook where flow of logic is linear, usually we will create variables in language A and obtain it in B, then to=lan makes sense. But in the context of workflows:

[A_1,B_1]

some scripts in language A

sos_export(to=?)

[B_2]

some scripts in language B

[A_2]

some scripts in language A

Then you see for workflows A and B we have troubles determine in *_1 which format to export to and the only way out is to use the general format regardless of whether or not lan is supported.

So this is another reason why destination-aware format export should come last, and should share interface with the more general approach so that there will involve minimal changes to code when switching between formats.

BoPeng commented 6 years ago

The to parameter is by design optional so users can always do export(data) to save data in a portable manner (which in case of sos notebook means exporting to the sos kernel), with limited datatypes and/or more information loss, poorer performance etc.

In theory we can create a really powerful language-independent format but creating bindings for all supported languages can be a real pain. So in my imagination, we should start from a small "core" set of datatypes (e.g. JSON) that can be easily implemented by all supported languages. Then we can work on point-to-point exchanger to allow more types. Once we have a set of point-to-point exchanger that works for all supported languages, we can move the exchangers to the language-independent part so that users can exchange this datatype without the to parameter.

I mean, it is better to start from a small number of data types and as many languages as possible, than from a large number of datatypes and a small number of languages. Actually, if you google, you can find a ton of formats for the latter.

BoPeng commented 6 years ago

BTW, export(data) should support as many datatype from the source language as possible regardless of information loss and performance. For example, named tuples, time series can be exported to lists etc, date can be exported as string or float number, because in my opinion having the data transferred is more important than having them transferred in a lossless manner because we have done at least part of the job and users can always post process the transferred data (e.g. convert string to date).

This however means problems with backward compatibility (e.g. data was transfered as string and now date), which is, as I have said before, a larger problem for dataexchanger than sos notebook.

gaow commented 6 years ago

So along the lines of small core data types JSON is already good for recursively storing atomic types across languages. That only leaves out one (or two) basic types: matrix (or additionally data frames). If as a first pass we are going to support JSON + matrix / dataframes it is going to satisfy a lot applications. But matrix can be large and the only efficient common interface is HDF5. ... So are we back to some HDF5 wrapper?

BoPeng commented 6 years ago

For those we will need some binary format such as Apache arrow / feather. The problem is that they do not have a lot of language bindings yet.

gaow commented 6 years ago

Okay but Apache Arrow is not recursive (hierarchical) right? Are you saying we can use JSON on top of it ?

BoPeng commented 6 years ago

We do not have to support recursive format. I mean, we can start from primitive types and array and dictionary of primitive types. Matrix, named arrays and collection of named arrays (dataframe) of primitive types can be added later.The problem with recursive types is that it might contain unsupported types and then the entire data is not transferrable.

gaow commented 6 years ago

Oh okay but recursive format could be important in some applications. For example in methods developments sometimes we do not know what exactly we should look at, so we dump everything out, typically recursively, to save in preparation for future needs. Examples include for an EM algorithm various quantities from every iteration. That is why R users at least has the tendency to always use recursive structure and saveRDS to serialize them altogether. I think we might have to think about what we user case we target at.

BoPeng commented 6 years ago

I think feather is a good model for implementation, although unfortunately feather has not been developed for a while with many pending requests (including my ticket). Although a common interface around JSON for many languages can be a good start, we could aim higher and base our format on arrow/feather. I believe they should have decent support for primitive formats even dictionaries, and we just need to add those to feather.

BoPeng commented 6 years ago

As you can imagine, implementing something in c/c++ and writing bindings for all supported languages can be a lot of work. Without major support json+csv is a much better starting point.

gaow commented 6 years ago

I understand. From my prospective if there is recursive support for json+csv then I can prototype and use for benchmarking to help making a compiling case that we need and we can do better.

Then maybe messagepack and be a drop in replacement?