openethereum / parity-ethereum

The fast, light, and robust client for Ethereum-like networks.
Other
6.82k stars 1.69k forks source link

JSON-RPC over IPC should envelope messages. #4647

Closed MicahZoltu closed 7 years ago

MicahZoltu commented 7 years ago

Crossposted https://github.com/ethereum/go-ethereum/issues/3702, since this is a systemic ecosystem issue.

When receiving JSON-RPC requests over WS or HTTP, the underlying protocol deals with ensuring that messages are well separated and that the connection can gracefully recover from bad bytes put onto the wire. While domain sockets/named pipes will deal with ordering and guaranteeing arrival, they do not do anything to separate messages. This leads to a sort of cascade of problems. The first being that the developer wanting to utilize the IPC needs to write code that can identify end of one message and beginning of another. Unfortunately, the only way to correctly deal with this is to write or use a streaming JSON parser that can stop when it reaches the end of an object. Unfortunately, most off-the-shelf parsers will error indicating malformed JSON if you try to parse a byte stream that has multiple separate JSON payloads concatenated together (such as the JSON-RPC over IPC protocol).

Even if the user writes a fully JSON compliant streaming parser, it also needs to deal with the fact that JSON is UTF-8 which means a surrogate pair may cross buffer boundaries. This means that not only does the user have to deal with the fact that messages themselves may cross buffer boundaries, but it also needs to deal with the fact that individual characters may cross message boundaries. This is a solvable problem but it introduces a lot of complexity for the developer to deal with.

I think most critically is the fact that I don't believe there is a recoverable way to write an end-on-end JSON parser without some kind of enveloping. Imagine someone sent this over HTTP, WS and IPC:

{ jsonrpc: "2.0", id: 5, method: "truncat { jsonrpc: "2.0", id: 6, method: "complete", params: "}{" } With HTTP and WS, the channel would remain open and the receiver would be able to respond with an appropriate error message indicating one of the messages was malformed. For IPC however, there is no correct way to deal with this. As far as the receiver is concerned, it is in the middle of parsing a JSON string until it receives the next quote, at which point it will see some invalid JSON following it. Unfortunately, the parser doesn't know where to read up to in order to "skip to the next message" and start over. It could try looking for something like }{, but as you can see in this example the }{ would be in the middle of another message (inside a string) but the parser doesn't know that, so it would fail again. Hypothetically you could keep repeating this process (look for some sort of boundary character sequence) until you find a valid message, but that is incredibly complex and I'm not entirely certain there isn't an edge case that would

Proposal

I propose that a version 2 (or are we up to 3?) of the JSON-RPC over IPC be built that envelopes messages with a sentinel header and a length prefix. The length prefix would allow parsers to simply wait until they have the whole message before trying to process any JSON and if the JSON or UTF-8 inside the envelope is invalid it can throw away the whole message without any trouble. The sentinel would be a special byte sequence that would need to be escaped if found anywhere in the payload (could be as little as one byte, but that would require more frequent escaping) and would allow the channel to recover cleanly from malformed input by simply reading until the next sentinel.

Re: sentinel, is a 0 byte valid anywhere in a JSON string? If not, then the null byte could be used as the sentinel without need for escaping.

Notes

There are several Ethereum JSON-RPC over IPC clients that I know of at this point (Geth, Parity, Web3, Ethrpc) and so far I haven't seen anyone fully solve this problem. I'm not entirely certain what Geth is doing since it appears to be relying on something from go-lang, but I would be mildly surprised if it correctly deals with all of the problems associated with un-enveloped IPC payloads.

MicahZoltu commented 7 years ago

It should be noted that this problem makes it much easier for attackers to write injection attacks since writing a good parser is so difficult. The current web3.js implementation, for example, currently is open to an injection attack if the attacker can get a }{ into the payload of a message (I suspect not hard).

maciejhirsz commented 7 years ago

Re: sentinel, is a 0 byte valid anywhere in a JSON string?

No, only specific whitespace characters are allowed between tokens (space, horizontal tab, line feed and carriage return) while in strings it has to be escaped as \u0000.

I like the idea of using 0x00 as sentinel followed length prefix, the only thing missing is the format of the length prefix. I reckon Protocol Buffer varint limited to 64 (32 for JS?) unsigned integers top would make a good candidate, with implementations readily available in any language.

Few details to take care of in the parser:

karalabe commented 7 years ago

Cross posting from the Go issue:

As I mentioned in the chat channel, I don't see how message corruption can occur. HTTP, WS are TCP based, IPC is backed by unix sockets or windows pipes; all take care of data chunking, ordering, retransmits, etc. Client and server side you should have a single thread reading a data stream and similarly either a single thread writing it or multiple ones properly synchronized. There is literally no way to corrupt a data stream.

Regarding the proposal, the data transport we are currently using is a standard JSON stream. It is a well understood and commonly used protocol. I'm certain any decent programming language is able to process it either out of the box or there are a number of libraries to do it. Converting it into a custom stream would only worsen the scenario, with everyone needing to implement our "new standard".

MicahZoltu commented 7 years ago

Continuing the conversation here rather than at Go-Ethereum since that issue was closed. Some of the quotes here are from that, but I want to consolidate the conversation. cc @bas-vk so all participants are here

In the Ethereum world this is not an issue because data is transferred hex encoded. As long as the client uses ascii for the id, jsonrpc and method fields you are fine with a naive implementation that assumes 1 char equals 1 byte.

All current documentation I have found says that the communication with nodes is standard JSON-RPC which is well defined and specced out to be UTF-8. If this is not true, it should be well documented everywhere that communication with an Ethereum node is a subset of JSON-RPC and it should be documented what exactly the constraints are. It is a fatal flaw to assume that those that come after you will understand the constraints that you were operating under and someone may decide to not hex-encode a payload which is reasonable and sound (hex encoding is terrible for bandwidth). Also, already there are at least a couple messages that don't comply with this assertion including net_version response. While this isn't a particular attack vector, it does mean that the hex encoding logic cannot be written at the protocol/transport layer of your code and the developer needs to ensure that every future message independently obeys this rule.

Most clients that use IPC will probably use something like Mist that takes case of these situations (Mist has a custom "dechuncker").

As I brought up with them, Mist's dechunker has bugs, the same as the web3.js client. Their custom dechunker may work against geth, but it doesn't work against all standards compliant JSON serializers. They can fix these bugs one at a time as they are found/noticed and new clients come out, but without a standard/spec to follow authors of clients are going to end up coding to "something that works with Geth/Parity" rather than, "something that will work with future Ethereum nodes". Also, it is my understanding that they already ran into such a bug where they needed to handle newlines between messages but they weren't. This is the problem when you are writing your own tokenizer/parser, it is sooo easy to get it wrong.

There is literally no way to corrupt a data stream.

Sure, once the bytes are on the wire there are guarantees by the transport layer that the bytes will arrive on the other side in order. However, a good API should be able to handle clients that have bugs and aren't written perfectly. It should be able to return useful error messages to the client when it does something wrong and it should be able to recover from errors cleanly.

applications that have access to a good streaming decoder.

the data transport we are currently using is a standard JSON stream. It is a well understood and commonly used protocol. I'm certain any decent programming language is able to process it either out of the box or there are a number of libraries to do it.

Clients that want to use this will need to write their own encoders/decoders instead of using some default library.

This is my biggest contention with the arguments against this proposal. Just because something has a line item reference on a Wikipedia page doesn't mean it is a standard that one should follow. In my research, go-lang is the only thing that supports this "streaming JSON standard" you are talking about. I couldn't find anything for Node and it appears Parity (Rust) had to roll their own as well. I have worked with a lot of JSON over the years in a number of languages and I have never seen "streaming JSON" as a thing that sane people do before. I would be very surprised if something this hard to code against made it through a good RFC process. It is possible I missed something, but these results suggest that everyone except Go developers have to write a custom parser and because of all of the problems I mentioned above, writing such a thing is non-trivial. Parity has effectively had to write its own JSON tokenizer which will get its channel into a bad state if a malformed message shows up on the wire and web3.js and Mist both have bugs in their implementation because they are doing a simple regex to try to guess message boundaries.

Here are some first-page Google results when searching for streaming JSON: http://stackoverflow.com/questions/36589663/how-to-parse-a-string-of-concatenated-json-in-browser http://stackoverflow.com/questions/38833020/how-can-i-split-a-string-containing-n-concatenated-json-string-in-javascript-nod/38833262 Notice that none of the answers actually solve the problem well and all of them have bugs/problems.


My hope is that the connections to ethereum nodes can be based on well defined standards, rather than "do exactly what geth does, and assume they won't change, and go read go-lang source code to figure out exactly what the constraints are, and also understand some other high-level constraints that aren't documented anywhere."

MicahZoltu commented 7 years ago

For reference:

karalabe commented 7 years ago

The spec was defined by the JavaScript web3, and agreed by all three origin implementations. It was not lead by Geth. So the fact that Geth can do it properly but you God forbid need to write a parser shouldn't be laid at our fault.

The hex encoding protocol was documented and standardized on our wiki page before any implementation existed. Don't say someone coming later doesn't have the infos because everything's there. The hex was again something that JavaScript needed and not enforced by Geth. We just play by the standard defined.

There are a number of streaming JavaScript parsers, look them up on npm or Google. The fact that someone decided to roll their own opposed to use an existing library is ludicrous.

On Feb 23, 2017 22:21, "Micah Zoltu" notifications@github.com wrote:

For reference: Parity de-concatenator: https://github.com/ethcore/jso n-ipc-server/blob/master/src/validator.rs#L21-L64 web3.js de-concatenator: https://github.com/ethereum/we b3.js/blob/develop/lib/web3/ipcprovider.js#L84-L89 Mist de-concatenator: https://github.com/ethereum/mi st/blob/master/modules/ipc/dechunker.js#L24-L32 ethrpc de-concatenator (doesn't actually have one, trying to build one is what lead to this issue): https://github.com/AugurProjec t/ethrpc/blob/master/index.js#L367-L384

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ethcore/parity/issues/4647#issuecomment-282109310, or mute the thread https://github.com/notifications/unsubscribe-auth/AAH6GelXRVlA2iOuu8CZw1bGJKQMQ2LOks5rfepAgaJpZM4MJkTR .

karalabe commented 7 years ago

And saying that you couldn't find a streaming json npm module is a flat out lie https://www.npmjs.com/search?q=streaming+json

On Feb 23, 2017 22:33, "Péter Szilágyi" peterke@gmail.com wrote:

The spec was defined by the JavaScript web3, and agreed by all three origin implementations. It was not lead by Geth. So the fact that Geth can do it properly but you God forbid need to write a parser shouldn't be laid at our fault.

The hex encoding protocol was documented and standardized on our wiki page before any implementation existed. Don't say someone coming later doesn't have the infos because everything's there. The hex was again something that JavaScript needed and not enforced by Geth. We just play by the standard defined.

There are a number of streaming JavaScript parsers, look them up on npm or Google. The fact that someone decided to roll their own opposed to use an existing library is ludicrous.

On Feb 23, 2017 22:21, "Micah Zoltu" notifications@github.com wrote:

For reference: Parity de-concatenator: https://github.com/ethcore/jso n-ipc-server/blob/master/src/validator.rs#L21-L64 web3.js de-concatenator: https://github.com/ethereum/we b3.js/blob/develop/lib/web3/ipcprovider.js#L84-L89 Mist de-concatenator: https://github.com/ethereum/mi st/blob/master/modules/ipc/dechunker.js#L24-L32 ethrpc de-concatenator (doesn't actually have one, trying to build one is what lead to this issue): https://github.com/AugurProjec t/ethrpc/blob/master/index.js#L367-L384

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ethcore/parity/issues/4647#issuecomment-282109310, or mute the thread https://github.com/notifications/unsubscribe-auth/AAH6GelXRVlA2iOuu8CZw1bGJKQMQ2LOks5rfepAgaJpZM4MJkTR .

karalabe commented 7 years ago

https://github.com/rust-lang-nursery/rustc-serialize

On Feb 23, 2017 22:35, "Péter Szilágyi" peterke@gmail.com wrote:

And saying that you couldn't find a streaming json npm module is a flat out lie https://www.npmjs.com/search?q=streaming+json

On Feb 23, 2017 22:33, "Péter Szilágyi" peterke@gmail.com wrote:

The spec was defined by the JavaScript web3, and agreed by all three origin implementations. It was not lead by Geth. So the fact that Geth can do it properly but you God forbid need to write a parser shouldn't be laid at our fault.

The hex encoding protocol was documented and standardized on our wiki page before any implementation existed. Don't say someone coming later doesn't have the infos because everything's there. The hex was again something that JavaScript needed and not enforced by Geth. We just play by the standard defined.

There are a number of streaming JavaScript parsers, look them up on npm or Google. The fact that someone decided to roll their own opposed to use an existing library is ludicrous.

On Feb 23, 2017 22:21, "Micah Zoltu" notifications@github.com wrote:

For reference: Parity de-concatenator: https://github.com/ethcore/jso n-ipc-server/blob/master/src/validator.rs#L21-L64 web3.js de-concatenator: https://github.com/ethereum/we b3.js/blob/develop/lib/web3/ipcprovider.js#L84-L89 Mist de-concatenator: https://github.com/ethereum/mi st/blob/master/modules/ipc/dechunker.js#L24-L32 ethrpc de-concatenator (doesn't actually have one, trying to build one is what lead to this issue): https://github.com/AugurProjec t/ethrpc/blob/master/index.js#L367-L384

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ethcore/parity/issues/4647#issuecomment-282109310, or mute the thread https://github.com/notifications/unsubscribe-auth/AAH6GelXRVlA2iOuu8CZw1bGJKQMQ2LOks5rfepAgaJpZM4MJkTR .

MicahZoltu commented 7 years ago

Going over the results of the NPM query you linked one at a time for the first full page: https://www.npmjs.com/package/streaming-json-stringify doesn't do concatenated separated. It is for parsing a single valid JSON object/array as it is received rather than all at once. This is similar to SAX parser for XML. https://www.npmjs.com/package/jsonparse appears to be the same thing, a SAX like parser. https://www.npmjs.com/package/ldjson-stream is a line delimited parser, not concatenated delimited parser. This style of streaming JSON breaks if the JSON has any newlines in the payload (e.g., valid formatted JSON). https://www.npmjs.com/package/ndjson also line delimited parser https://www.npmjs.com/package/pino is a logger, not sure why it is in the search results https://www.npmjs.com/package/level is a LevelDB wrapper, not sure why it is in the results https://www.npmjs.com/package/collect-json Not a streaming parser. https://www.npmjs.com/package/papaparse CSV parser https://www.npmjs.com/package/json-stream line delimited parser https://www.npmjs.com/package/bfj Unclear what exactly this is, no concrete examples in docs or any tests that were immediately obvious.


https://github.com/rust-lang-nursery/rustc-serialize appears to just be a JSON deserializer, it doesn't appear to do anything with regards to newline delimited or concatenated JSON streams. Perhaps I am overlooking something in the documentation?

tinybike commented 7 years ago

FWIW, I encountered these issues when I tried to write ethrpc's IPC implementation originally (about a year ago now) -- I was also unable to find any good JavaScript library solutions. I agree that IPC messages should be enveloped. Furthermore, I think that the appropriate time to make a change like this is now, while there are only a couple widely-used clients (geth and parity).

And saying that you couldn't find a streaming json npm module is a flat out lie

@karalabe Which module, specifically, deals with concatenated streaming JSON? I browsed those links and did not find one. Also, can we please avoid accusations of lies etc.? Presumably, the only interest anyone here has is in future-proofing Ethereum's IPC / making it work as well as possible.

MicahZoltu commented 7 years ago

@karalabe I'm assuming you are referring to this documentation? https://github.com/ethereum/wiki/wiki/JSON-RPC#hex-value-encoding

Unfortunately it starts with "at present", implying that in the future other serialized data types may be sent. This is the kind of problem I'm warning against in that future developers may decide that hex value encoding is sub-optimal (it largely is for string data) and decide to add a third datatype that is a UTF-8 encoded string. This would be a very reasonable course of action without knowledge of the injection attack vector. IMO, robust code should make it hard for future developers to accidentally create injection attacks, not rely on future developers being knowledgeable and aware of all attack vectors. If you are intentionally adding code or a segment of a spec to protect against attack, then you should be explicit about why the code/spec exists. At the least, the documentation should state that values are all hex encoded to protect against injection attacks, similar to how URLs are URL encoded.

Also, if hex encoding is required of all values passed, then it should be done universally and special cases like net_version, web3_clientVersion and shh_version should be deprecated. This allows the security code (encoding/decoding) to live in one place in the code base near the wire rather than having to be re-implemented for every method producer/consumer. This reduces the chance that future developers will break the pattern or accidentally do it wrong. By having exceptions to the rule already in the codebase it further increases the liklihood that future developers will not follow it. They will see net_version and think, "apparently I can safely ignore the encoding section, it must be out of date".

Finally, right at the top of that page it clearly says:

It uses JSON (RFC 4627) as a data format

A security conscious developer looking to write robust code that interops with Geth/Parity will read this and attempt to implement an RFC 4627 compliant parser. This will result in failure for all of the reasons above. If Ethereum wants to keep the constraints you have laid out, please at least document them in the Wiki at the top and make it clear that it isn't an RFC 4627 compliant data format but rather a subset format. If you like I can try to make these changes to the documentation, should this proposal fail for Geth and Parity (the two primary clients).

By the way, is the whole "concatenated messages when sent over IPC" documented anywhere? If not I think it should be, if so can someone please point me at these docs? I figured out how to talk over IPC via trial and error and looking at source code because I couldn't find any documentation.


In summary: Coding a client to-spec (rather than to-Geth or to-Parity) at the moment is incredibly hard. I recommend either fixing the spec to reflect reality, or fix reality to match the spec.

karalabe commented 7 years ago

Please read https://en.m.wikipedia.org/wiki/Recursive_descent_parser if you are unfamiliar with how parsers work or how one can be implemented to correctly detect the end of a json message from an infinite data stream.

UTF8 data streams are similarly not a problem, you just have to work with runes and not bytes.

On Feb 23, 2017 23:05, "Micah Zoltu" notifications@github.com wrote:

@karalabe https://github.com/karalabe I'm assuming you are referring to this documentation? https://github.com/ethereum/ wiki/wiki/JSON-RPC#hex-value-encoding

Unfortunately it starts with "at present", implying that in the future other serialized data types may be sent. This is the kind of problem I'm warning against in that future developers may decide that hex value encoding is sub-optimal (it largely is for string data) and decide to add a third datatype that is a UTF-8 encoded string. This would be a very reasonable course of action without knowledge of the injection attack vector. IMO, robust code should make it hard for future developers to accidentally create injection attacks, not rely on future developers being knowledgeable and aware of all attack vectors. If you are intentionally adding code or a segment of a spec to protect against attack, then you should be explicit about why the code/spec exists. At the least, the documentation should state that values are all hex encoded to protect against injection attacks, similar to how URLs are URL encoded.

Also, if hex encoding is required of all values passed, then it should be done universally and special cases like net_version, web3_clientVersion and shh_version should be deprecated. This allows the security code (encoding/decoding) to live in one place in the code base near the wire rather than having to be re-implemented for every method producer/consumer. This reduces the chance that future developers will break the pattern or accidentally do it wrong. By having exceptions to the rule already in the codebase it further increases the liklihood that future developers will not follow it. They will see net_version and think, "apparently I can safely ignore the encoding section, it must be out of date".

Finally, right at the top of that page it clearly says:

It uses JSON (RFC 4627) as a data format

A security conscious developer looking to write robust code that interops with Geth/Parity will read this and attempt to implement an RFC 4627 compliant parser. This will result in failure for all of the reasons above. If Ethereum wants to keep the constraints you have laid out, please at least document them in the Wiki at the top and make it clear that it isn't an RFC 4627 compliant data format but rather a subset format. If you like I can try to make these changes to the documentation, should this proposal fail for Geth and Parity (the two primary clients).

By the way, is the whole "concatenated messages when sent over IPC" documented anywhere? If not I think it should be, if so can someone please point me at these docs? I figured out how to talk over IPC via trial and error and looking at source code because I couldn't find any documentation.

In summary: Coding a client to-spec (rather than to-Geth or to-Parity) at the moment is incredibly hard. I recommend either fixing the spec to reflect reality, or fix reality to match the spec.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ethcore/parity/issues/4647#issuecomment-282120852, or mute the thread https://github.com/notifications/unsubscribe-auth/AAH6GVhHxvS-_scz0FwxLWLF2QUpuBrGks5rffSggaJpZM4MJkTR .

tinybike commented 7 years ago

@karalabe AFAIK everyone who has attempted to implement a client for geth's IPC agrees that some form of enveloping is needed. (@frozeman / Parity crew, please chime in if this isn't accurate.) In other words: you are building a product for developers, all of whom are telling you that a feature in your product is broken. Your response cannot seriously be to say, "it's working, you all just don't know how parsers work".

@obscuren: does @karalabe represent the geth development team's view on this? If so, I guess this is a lost cause; we'll drop support for IPC and just use websockets.

MicahZoltu commented 7 years ago

Please read https://en.m.wikipedia.org/wiki/Recursive_descent_parser if you are unfamiliar with how parsers work or how one can be implemented to correctly detect the end of a json message from an infinite data stream.

Sorry if I mis-represented my concern here, the problem isn't that it is not possible to write a JSON parser that works for valid input sequences. My issues are:

  1. It is not easy (to do right)
  2. concatenated JSON is not common (libraries are hard to find and/or buggy)
  3. even if you do it right, it does not self-recover from invalid data on the stream (buggy clients/servers).

The UTF-8 problem is similarly solvable, but it greatly complicates the matter since you can't just take the byte stream and treat it like a string, your bytes to UTF-8 conversion needs to be done inside the streaming processor.

Again, solvable problems, but hard to get right (as seen by all current implementations except Geth, which is doing whatever Go does under the hood).

bas-vk commented 7 years ago

To get a better understanding of the problem I tried to create a javascript client and came up with this snippet. I'm not a js developer but I have a hard time understanding what the problem actually is? So far all languages that I worked with did have some kind of stream encoder/decoder support, either through its standard library or through some external library.

I can see the benefit for delimiters, message size or something similar in case you don't have such tools available. But going into that direction means that all platforms that have support for json stream decoders won't be able to use the ipc socket directly but need to do something special. I'm pretty sure that will make other developers unhappy.

In other words: you are building a product for developers, all of whom are telling you that a feature in your product is broken.

I also build DApps that use the ipc socket and I have never through of this feature to be broken.

frozeman commented 7 years ago

I do agree with @MicahZoltu here. The current IPC streaming is error prone and not clearly defined. Thats also the reason why i had to come up with this quirk: https://github.com/ethereum/web3.js/blob/develop/lib/web3/ipcprovider.js#L84-L89

I together with @bas-vk we implemented the IPC, and i do remember bringing up the possibility to properly send END characters, tho i came then with that workaround and we never looked at it again.

@bas-vk The issue is that when you receive a JSON object which is larger than the buffer of the socket the json comes in chunks. Then receiving and parsing becomes problematic. Thats the reason why i had to come up with that dechunker. You probably never encountered that with your snippet as you don't receive large JSON resuslts. Thats easy to test with polling for 100 filters created.

So i do agree that this can become an issue and we should solve it. As you can see all JSON streaming libs use /n as a separator, which is a valid way to separate the incoming json.

So my proposal is that we end JSON objects by /n/r, which is already supported by Mist since 8 versions and i think already like this in geth. BUT we need to make sure that any /n inside a result object now or in the future (of method results AND notifications) is properly escaped \/n or \/r.

This would give us backwards compatibility in mist, is a small change and prevent future hacks to break the stream. At the same time this will reduce the parsing Mist and others have to do with the current try-to-parse implementation.

karalabe commented 7 years ago

@frozeman Did you try the snipped @bas-vk sent you? Or are you assuming the library will choke just because you didn't manage to implement it properly?

karalabe commented 7 years ago

@frozeman / Parity crew, please chime in if this isn't accurate.

Are you trying to solve your problem here, or creating a loud enough echo chamber to silence legitimate solution attempts?

@obscuren: does @karalabe represent the geth development team's view on this? If so, I guess this is a lost cause; we'll drop support for IPC and just use websockets.

Power games and threats won't help your case, on the contrary, it will just piss people off from even trying to help you and explain what is wrong with your approach.


This Go server side program gradually sends in larger and larger JSON messages in a single stream, starting from about 32KB, going up by a bit more that 32KB in each iteration. It stops when it reaches one JSON being larger than 32MB. That is clearly above any possible protocol chunking that may occur.

package main

import (
    "encoding/json"
    "fmt"
    "net"
)

func main() {
    listener, err := net.Listen("unix", "socket.ipc")
    if err != nil {
        return
    }
    defer listener.Close()

    for {
        conn, err := listener.Accept()
        if err != nil {
            fmt.Println("Listen failed", err)
            return
        }
        defer conn.Close()

        out := json.NewEncoder(conn)

        // Send out JSON messages gradually growing from a few bytes to over 32 megabytes
        msg := struct{ Items []string }{}
        for batch := 1; batch < 1000; batch++ {
            for i := 0; i < 1000; i++ {
                msg.Items = append(msg.Items, "0123456789abcdef0123456789abcdef")
            }
            if err := out.Encode(msg); err != nil {
                fmt.Println("Encode failed", err)
                break
            }
        }
    }
}

This program adapted from @bas-vk 's code simply reads input JSONs one after the other, emitting the number of items is sees.

var oboe = require('oboe')
var net = require('net');

var client = net.createConnection("socket.ipc");

client.on("connect", function() {
  oboe(client).done(function(msg) {
    console.log(msg.Items.length);
  })
});

The output?

$ nodejs client.js
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
...
696000
697000
698000
699000
700000
// killed because I got bored of it

@frozeman @MicahZoltu @tinybike Pray tell me which part of the above JavaScript is so horribly convoluted, that you decide to roll your own parser, and then so loudly argue that it's too complicated? The entire parsing is literally one single line of code.

frozeman commented 7 years ago

Ok i had a call with all parties involved here in the go team and came to the conclusion that the issue is a non issue.

Due to the fact that they JSON.marshall everything properly, injections which split the json aren't possible. See: https://play.golang.org/p/DHsCSo9ZgN

This hopefully is also the case for parity.

On the contrary @bas-vk found this great json parser library, which parses character by character and completes objects this way, which is way better than my dechunker.

I full heartily recommend using one of these libraries: https://oboejs.com or the former SAX parser https://github.com/dscape/clarinet

holiman commented 7 years ago

This style of streaming JSON breaks if the JSON has any newlines in the payload (e.g., valid formatted JSON).

Just to clarify here: There can be no newlines in JSON payload. See the specification : they are encoded.

If anyone still has issues with json parsing, I'll gladly help out with any questions. Hit me up over Gitter!

If there are any non-compliant JSON-encoders (as in, if Geth or Parity or CPP-ethereum or whatever) spits out non-correct JSON, please let me know, I'd gladly pursue that, since it's a de-facto flaw and possibly a vulnerability.

maciejhirsz commented 7 years ago

@holiman To clarify the clarification: they have to be escaped in strings, but if you follow the link to ECMA 404 pdf you will find exact definition of whitespace:

The whitespace characters are: character tabulation (U+0009), line feed (U+000A), carriage return (U+000D), and space (U+0020)

So yes, new lines can in fact occur in valid JSON payload, though that only happens if you do a pretty-print of some kind.

holiman commented 7 years ago

So yes, new lines can in fact occur in valid JSON payload, though that only happens if you do a pretty-print of some kind.

Yes, I was not clear enough. With JSON payload, I meant within a String element in a JSON payload.

tinybike commented 7 years ago

Pray tell me which part of the above JavaScript is so horribly convoluted, that you decide to roll your own parser, and then so loudly argue that it's too complicated? The entire parsing is literally one single line of code.

This is a good starting point. Thank you for the code example. I'm not sure giant arrays containing the same 16 bytes over and over again makes for a particularly representative or challenging test case though. Let's try the same simple test again, just with more data; if it works, then we can try adding some actual variation to the data as well. Here's a real RPC request to geth, which we make thousands of times every day as part of Augur's initial markets loading:

curl -X POST --data '{"id":19,"jsonrpc":"2.0","method":"eth_call","params":[{"from":"0x05ae1d0ca6206c6168b42efcd1fbe0ed144e821b","to":"0xcece47d6c0a6a1c90521f38ec5bf7550df983804","data":"0x43fd920300000000000000000000000000000000000000000000000000000000000f69b50000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000","gas":"0x2fd618","timeout":"0x3a980"},"latest"]}' https://eth9000.augur.net

Here's the modified Go program in a Gist, with the output of the above RPC request pasted in (fuckedString): https://gist.github.com/tinybike/602213d256f82b69b86fafef6f5e22d0

Here's the output from the JavaScript code snippet you posted:

100
200
300
400
500
600
700
800
900
1000
1100
/home/jack/src/ipc-test/node_modules/oboe/dist/oboe-node.js:610
      maxActual = Math.max(maxActual, textNode.length);
                                              ^

TypeError: Cannot read property 'length' of undefined
    at checkBufferLength (/home/jack/src/ipc-test/node_modules/oboe/dist/oboe-node.js:610:47)
    at handleData (/home/jack/src/ipc-test/node_modules/oboe/dist/oboe-node.js:1025:7)
    at applyEach (/home/jack/src/ipc-test/node_modules/oboe/dist/oboe-node.js:496:20)
    at Object.emit (/home/jack/src/ipc-test/node_modules/oboe/dist/oboe-node.js:1929:10)
    at Socket.<anonymous> (/home/jack/src/ipc-test/node_modules/oboe/dist/oboe-node.js:1128:34)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
    at readableAddChunk (_stream_readable.js:176:18)
    at Socket.Readable.push (_stream_readable.js:134:10)
    at Pipe.onread (net.js:551:20)

Output from the Go program:

Encode failed write unix socket.ipc->@: write: broken pipe

Stopping here because the program crashed. The next step would be to add some real variation to the data (both its contents and its length).

(I should add, I'm not a Go developer, so please let me know if I'm messing something up / if this is an unfair example.)

bas-vk commented 7 years ago

That isn't a fair example. You append fuckedString (length: 427584) in each iteration of the loop, serializes to json and then send it. That data stream is enormous and I'm pretty sure you hit a OS limit (there is a OS dependant limit what unix sockets can buffer before writing more to it fails) and adding a delimiter, length indication of whatever doesn't make any difference.

karalabe commented 7 years ago

I'm not sure it's a socket limitation. I'd expect it to block. But having a 500MB single json may nonetheless not be the fairest test case to test rpc responses. Would be nice to explore if there are any buffer limitations in place inside the js library, but I'd bet on hitting that. Also I'd try to check system memory to see if you are hitting some nidejs limits, denying further memory to be allocated in your test case.

On Feb 24, 2017 7:45 PM, "bas-vk" notifications@github.com wrote:

That isn't a fair example. You append fuckedString (length: 427584) in each iteration of the loop, serializes to json and then send it. That data stream is enormous and I'm pretty sure you hit a OS limit (there is a OS dependant limit what unix sockets can buffer before writing more to it fails) and adding a delimiter, length indication of whatever doesn't make any difference.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ethcore/parity/issues/4647#issuecomment-282355910, or mute the thread https://github.com/notifications/unsubscribe-auth/AAH6GZpeb_2fzDrF2dpomDSZlI54uzTNks5rfxdCgaJpZM4MJkTR .

MicahZoltu commented 7 years ago

I feel like this conversation is getting off track. My goal with this issue wasn't to get help writing a concatenated JSON parser in a particular language, that is something I am capable of. The purpose of this issue was to try to improve the client developer experience by encouraging Geth/Parity to offer an easier to implement API. A quick side-note about oboe, it doesn't handle UTF surrogates split on buffer boundaries; this is one of the many subtle bugs that make implementing concatenated JSON parsers difficult. However, I am not looking to use this issue to troubleshoot this particular problem, nor any other problem I have brought up.

I believe the point of disconnect between the two "camps" here is that of "how common is concatenated JSON". Based on some of the comments I have seen throughout this thread, there seems to be a misconception that "Streaming JSON" is synonymous with "Concatenated JSON". I don't believe this to be the case and google and NPM searches seem to agree with me on this. Streaming JSON, in most cases, refers to single well defined JSON objects that are received and processed as a stream of bytes/characters rather than as a single collection of bytes/characters and processing the JSON object piecemeal as it comes in rather than all at once. This type of "stream" processing is very important when you are trying to receive/parse a 4GB streamed JSON array and you want to parse, handle, and discard elements as they come in or if you are trying to write a message stream processor that receives large JSON objects and you want to avoid old gen memory pressure in a garbage collected language.

The thing that many of you appear to be talking about is Concatenated JSON. To the best of my knowledge this is fairly uncommon because the simpler (and more performant) solution to dealing with this sort of thing is enveloping or at least a sentinel sequence. I actually did go looking for a JavaScript concatenated JSON parser before opening up this issue. I didn't try oboe because the documentation describes it as a stream JSON processor (common use of the word) and nowhere do they claim to support concatenated JSON. There is even an open issue requesting support for concatenated JSON: https://github.com/jimhigson/oboe.js/issues/44#issuecomment-57899373 (oboe author calls it Multi-JSON). The fact that it works at all for concatenated JSON (ignoring the bug I found) is surprising.

Are there other arguments against enveloping besides the belief that Concatenated JSON is common/easy? If so, I would like to hear them. If not, then we should focus any further discussion on that. Regarding the ease of concatenated JSON parsing, my arguments are generally around a spec/RFC compliant concatenated JSON parser which means it must support UTF-8, whitespace (outside of tokens), etc. If Geth/Parity are asserting that this is not necessary, then I request that the documentation be updated to not only refer to the JSON-RPC spec but also include the additional constraints that the Ethereum JSON-RPC spec holds (e.g., ASCII-7 only, no whitespace, etc.). All of this started because I tried to write a spec compliant parser and ran into trouble after trouble and couldn't find a library that would do it for me (correctly).

TL;DR: My argument is that a Concatenated JSON parser is hard to implement correctly, hard to find libraries for and not commonly used and should be traded in for an easier to implement protocol. The primary counter argument is that it is common and finding libraries is easy. I believe this is because of a misconception of the definition of "Streaming JSON".


oboe bug: for reference only please don't try to troubleshoot this particular problem in this thread, I'm only including it for reference!

var server = net.createServer(function (serverSocket) {
  var response = JSON.stringify({ jsonrpc: "2.0", id: 1, result: "☃" }) + JSON.stringify({ jsonrpc: "2.0", id: 1, result: "☃" });
  var buffer = Buffer.from(response, "utf8");
  // split the buffer part-way through the ☃ code-point
  var firstHalfOfBuffer = buffer.subarray(0, buffer.byteLength / 2 - 3);
  var secondHalfOfBuffer = buffer.subarray(buffer.byteLength / 2 - 3);
  // submit the first part of the buffer, it will end half way through the ☃
  serverSocket.write(Buffer.from(firstHalfOfBuffer));
  // submit the rest of the buffer in a second, to ensure it reaches the client separately
  setTimeout(function () {
    serverSocket.write(Buffer.from(secondHalfOfBuffer));
  }, 1000);
});
server.listen("\\\\.\\pipe\\socket.ipc")

var clientSocket = net.createConnection("\\\\.\\pipe\\socket.ipc", function () {
  oboe(clientSocket).done(function (jso) {
    console.log("Client 1 Received: " + JSON.stringify(jso));
  });
});

output:

Client 1 Received: {"jsonrpc":"2.0","id":1,"result":"���"}
Client 1 Received: {"jsonrpc":"2.0","id":1,"result":"☃"}
tinybike commented 7 years ago

The following RPC methods return non-hex-encoded strings, which (if they are not strictly ASCII-7) are in principle vulnerable to the splitting issue @MicahZoltu raised:

parity_accountsInfo is the only one of these methods that will return arbitrary user-specified strings. net_version and eth_protocolVersion seem harmless as they only return stringified base-10 numbers. web3_clientVersion returns a (non-hex-encoded) string like Geth/v1.5.9-stable-a07539fb/linux/go1.7.3 -- is there something forcing this version string to be ASCII-7? (How about if a Chinese developer writes an Ethereum client and names it something in Chinese?)

One workaround that doesn't involve changing IPC itself is to convert to hexadecimal character codes prior to transmission, then re-encode as UTF-8 upon receipt. Augur supports full UTF-8 for things like market descriptions this way.

Anyway, all this said, AFAIK this issue is not directly relevant to Augur, so I'm not going to keep pushing. My goal here is just to make sure the geth and parity teams are aware of this issue; it's up to you guys to decide what (if anything) to do about it.

maciejhirsz commented 7 years ago

Regarding the UTF-8 concerns: all non-ASCII UTF-8 codepoints begin with a header byte that has at least the first two bits set, followed by 1 to 3 bytes with high bits being exactly to 10. In other words, all non-ASCII codepoints are represented by bytes that have the first bit set, thus they can never contain ASCII characters, which is very much by design.

When parsing JSON, if all you are concerned with are things like depth-changing tokens ({,}, [, ]) and string encoding tokens (", \), then whether the entire stream is ASCII or UTF-8 makes absolutely no difference to the parser. Once you de-concatenate a single JSON payload from the stream, it would be wise (if not required) to verifiy that it's valid UTF-8, but this can be done in a separate step and would have to be done for enveloped messages as well.

NikVolf commented 7 years ago

After reading all the comments we are happy to add \n separator to the server response and guarantee that there will be always valid JSON between these separators. @karalabe and other guys have a valid concerns against further enveloping the messages with payload length to me.