juanjoDiaz / streamparser-json

Streaming JSON parser in Javascript for Node.js and the browser
MIT License
133 stars 11 forks source link

Message parser should be able to support arbitrary whitespace such as '\n', '\t', '\r', and ' ' within and between messages #35

Open addievo opened 1 year ago

addievo commented 1 year ago

How do I configure the stream parser to be able to discard whitespace between JSON messages?

I am facing an issue during implementation of a RPC system, which requires usage of JSONParser to parse binary to JSON for transmitting thru RPC. The issue I am facing is that input streams could be separated by a variety of different whitespaces, and the current implementation of seperator in the lib only supports a single separator.

Assuming as such, we have to keep out input streams in the following manner : {...message}{...message}.

However, to improve readability, we wish to be able to add whitespaces in between messages, in a similar manner as demonstrated below.

// Normal separation
{...message}{...message}{...message}

// Spaces
{...message} {...message}    {...message}

// New lines
{...message}
{...message}
{...message}

// Any combination
{...message}                                  {...message}
                       {...message}
{...message}
juanjoDiaz commented 1 year ago

Hi @addievo,

There is a separator option that you can use to set whatever you want to be the separation between the JSON objects. You could set the separator to /n, /n/n or whatever you want.

Would that work for you?

addievo commented 1 year ago

Hi @juanjoDiaz,

Thank you for your reply. While the separator option does exist, it appears limited to recognizing only one type of separator between JSON messages. In our use-case, we need the parser to accommodate a range of whitespace characters such as '\n', '\t', '\r', and ' ' as valid separators between messages. Is it possible to extend the functionality to support multiple types of separators?

CMCDragonkai commented 1 year ago

@juanjoDiaz is it possible to make separator option into a regex string? That way we could easily capture any kind of characters we expect to be whitespace?

juanjoDiaz commented 1 year ago

I think that regex is too flexible. It will have a big performance impact for such a concrete use case.

I'm thinking of allowing an array of separators. Since you can chain as many separators as you want, it will solve your case.

juanjoDiaz commented 1 year ago

I implemented this but the performance impact is too big for me to accept it.

So I need to think of a way to keep the current performance for people that is fine working with a single separator while allowing multiple separators for those that can accept the performance impact.

CMCDragonkai commented 1 year ago

Wouldn't the performance only be impacted if there were actual whitespace between the json strings? And also if I give an array of 1 character wouldn't that be the same as just specifying a single character?

And when you say we can chain whitespace, do you mean that if there is more than 1 whitespace character in the stream, it will auto-drop them all?

addievo commented 1 year ago

Hey @juanjoDiaz,

What about starting off with a quick check to see how many types separators are in the array? If it's just one, we could stick to the fast route we already have. If there's more, then we switch to the slower, but more flexible, multi-separator logic. This way, we get the best of both worlds without a big hit on performance. What do you think?

CMCDragonkai commented 1 year ago

Just to clarify the input stream may have 0 or more whitespace characters. The idea is for the stream parser to completely remove them all regardless of how many there is. So we cannot fix the number of expected whitespace separators like '\n\n'.

juanjoDiaz commented 1 year ago

Wouldn't the performance only be impacted if there were actual whitespace between the json strings? And also if I give an array of 1 character wouldn't that be the same as just specifying a single character?

During tokenization, we need to check for every character to see if it is a separator o no. Checking a single character is very fast. Checking a sequence of characters is a bit slower but still fast. Checking against a list of possible characters or character sequences is also slower.

Considering that you might be streaming a million JSON object separated by a simple character, the overhead of checkign against the list piles up and becomes very noticeable.

And when you say we can chain whitespace, do you mean that if there is more than 1 whitespace character in the stream, it will auto-drop them all?

Yes. If your separator is "\n", any number of new lines will be skipped. (e.g.: \n\n\n\n). There is a unit test specific for this.

The problem is with matching "\n" or "\s".

What about starting off with a quick check to see how many types separators are in the array? If it's just one, we could stick to the fast route we already have. If there's more, then we switch to the slower, but more flexible, multi-separator logic. This way, we get the best of both worlds without a big hit on performance. What do you think?

Yes. This is what I want to try next.

addievo commented 1 year ago

Hey @juanjoDiaz,

Thanks a bunch for your reply and assistance. Looking forward to having the array based separator.

KWLandry-acoustic commented 5 months ago

Hi @juanjoDiaz , I think I might be running into something similar, What about a 'Transform' callback where I could parse the chunk coming through in the read stream and 'transform' any whitespace or other characters as in chunk.replace(x,y) (or whatever would be appropriate for what's actually made available) This places the performance hit on my code (with plenty of caveats in the documentation for this capability). Of course, I would want access to the whole chunk or buffer about to be parsed not each character at a time.

juanjoDiaz commented 5 months ago

Hi @KWLandry-acoustic ,

You issue is not related.

I just realized that I'm an idiot and this feature was supported since the very beginning 🤦‍♂️

You just need to set the separator to the empty string '' and all whitespaces will be allowed as separator.

KWLandry-acoustic commented 5 months ago

@juanjoDiaz Excellent, thanks! I do remember reading this somewhere, in the docs perhaps, but it did not translate, or actually I remember I ran into a problem with it and had to specify a separator, Regardless, in any case, I'll set and test that, Thanks, KWL