Proposal: Move to JSON as the primary data interchange format

Aklakan commented 6 years ago

Short story: I think in the future we should move away from the command bytes and messages to defined JSON messages. For now, the proposal is:

Reserve byte code 123 (the open brace symbol '{' in ASCII) for JSON messages.
Update components, documentation and examples to interface with JSON messages

Long story: The nice thing of AMQP is, that any data can be sent to any component. As the core code is Java, of course we will implement all needed utility functions to put our messages onto the channel and read it out again.

However, if we actually wanted to take advantage of AMQP and connect e.g. a python or JavaScript component, it will be a real pain having to reimplement all the binary encoding again. Also, just small changes in the protocols will most likely break everything.

For example, this is what the AbstractSystemAdapter writes out:

    protected void sendResultToEvalStorage(String taskIdString, byte[] data) throws IOException {
        byte[] taskIdBytes = taskIdString.getBytes(Charsets.UTF_8);
        int capacity = 8 + taskIdBytes.length + data.length;
        ByteBuffer buffer = ByteBuffer.allocate(capacity);
        buffer.putInt(taskIdBytes.length);
        buffer.put(taskIdBytes);
        buffer.putInt(data.length);
        buffer.put(data);
        this.sender2EvalStore.sendData(buffer.array());
    }

Its already a pain not being able to do GSON.parse(...) on that, but now think about a JavaScript developer wanting to interface with that. As everything is JSON in this world, the attempt to using JSON.parse(msg) will fail miserably!

Instead, if we recommended to use JSON based protocols, everthing gets more readable, understandable, and overall developer and human friendly.

{
    "taskId": "faceted-benchmark-task-1-3",
    "timestamp": "2017-09-21T15:10:34Z", // utc
    "whatever": { "data": [ "one", "wishes", "to", "exchange" ] } 
}

An issue will still be how to handle streams of data, where the JSON overhead might be prohibitive, e.g. the transfer of a ZIP archive generated by the data generator. For example, the existing BSBM benchmark creates a set of files which is required as input to its test driver which corresponds to our task generator + system adapter.

Aklakan commented 6 years ago

Another reason why to use JSON: Found this gem in AbstractEvaluationStorage: The binary message format for expected and actual results look almost the same:

Expected results:

                    public void handleData(byte[] data) {
                        ByteBuffer buffer = ByteBuffer.wrap(data);
                        String taskId = RabbitMQUtils.readString(buffer);
                        byte[] taskData = RabbitMQUtils.readByteArray(buffer);
                        long timestamp = buffer.getLong();
                        receiveExpectedResponseData(taskId, timestamp, taskData);

Actual results:

boolean temp = false;
final boolean receiveTimeStamp = temp;

public void handleData(byte[] data) {
    ByteBuffer buffer = ByteBuffer.wrap(data);
...   
    long timestamp = receiveTimeStamp ? buffer.getLong() : System.currentTimeMillis();
}

Now, if you write with the format for actual results and read with the code for expected results, you will get a BufferUnderrun error because the timestamp is not there. It would be much easier to determine the problem if the code was myJsonObject.get("timestamp").getAsLong().

MichaelRoeder commented 6 years ago

[HOBBIT] Github issues

MichaelRoeder commented 6 years ago

In general, it is fine to use JSON for the internal communication of components. Benchmark developers are free to use whatever communication they want. However, we already put some effort into our current benchmark implementations and shouldn't throw all of the achieved developments away :wink:.

Advantages of introducing JSON:

easy usage
extendability (while the most of our queues support this already)

However, please note that using JSON might add additional effort. Imagine a benchmark that uses another serialization that is not compatible with JSON but uses several similar symbols. The JSON parser might has to analyze all the data and "escape" the symbols that are reserved for JSON. This additional effort is not necessary in our current byte array based implementation.

For a further discussion, we have to distinguish the different queues.

Command Queue

It has to be taken into account that the command queue is broadcasting all messages. A major disadvantage of introducing JSON is its conflict with the command queue addressing schema (that has been introduced to support #114). When using JSON and sending large amounts of data, the complete data would have to be parsed (since JSON does not have a predefined order of elements AFAIK) before the component can decide whether it should process the data or the message can be ignored.

However, reserving the code 123 for future extensions is a very good idea. I updated the wiki page at https://github.com/hobbit-project/platform/wiki/Command-Queue#predefined-command-ids

Task related queues

The queues TG-SA, TG-ES and SA-ES have a predefined structure that could have been defined using JSON as well.

However, the comment regarding the ES ignores the way how the receiveTimeStamp flag is set. It is set from the benchmark controller when creating the ES. That is necessary because otherwise a system could cheat very easely by sending its own timestamp to the ES.

Evaluation queues

The EM-ES and ES-EM queues could have been implemented using JSON as well.

Other queues

For other queues, there is no predefined structure and JSON or any other data can be sent.

@Aklakan @yamalight @denkv comments?

yamalight commented 6 years ago

@MichaelRoeder non-breaking way (i.e. reserving 123) seems like a pretty great idea. if doing so really doesn't break any old code/models - don't see any downsides 👍

hobbit-project / platform