Tokenizer token offset is incorrect

tegefaulkes commented 1 year ago

The Tokenizer outputs the wrong offset for tokens after a string token with special characters. The difference in the expected offset is consistent with the number of certain special characters within the input string.

Some examples

This is the expected behaviour

test('testing string 1', async () => {  
  const json = JSON.stringify({"abcd": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[7])  
})  

// raw string length:  15
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: 'abcd', offset: 1 }  
// { token: 4, value: ':', offset: 7 }  // Using this token as the reference
// { token: 9, value: 'abcd', offset: 8 }  
// { token: 1, value: '}', offset: 14 }  
// :  // We print the expected character

Using a single \t special character

test('testing string 2', async () => {  
  const json = JSON.stringify({"ab\t": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[6])  
})  

// raw string length:  15 // Same length as above
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: 'ab\t', offset: 1 }  
// { token: 4, value: ':', offset: 6 } // Off by 1 now
// { token: 9, value: 'abcd', offset: 7 }  
// { token: 1, value: '}', offset: 13 }  
// " // This isn't the character we expected

The difference in expected output is consistent with the number of special characters

test('testing string 3', async () => {  
  const json = JSON.stringify({"\t\n": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[5])  
})  

// raw string length:  15  // Same length
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: '\t\n', offset: 1 }  
// { token: 4, value: ':', offset: 5 }  // Off by 2 now
// { token: 9, value: 'abcd', offset: 6 }  
// { token: 1, value: '}', offset: 12 }  
// n

My expectation here should be that the offset is relative to the input. I understand that this is a niche use case but is this something you can fix?

juanjoDiaz commented 1 year ago

Hi @tegefaulkes ,

streamparser-json moves the offset based on the parsed string. This means that \t or \n count as 1 character.

Now you are making me wonder if that is the correct way of counting or not.... 🤔

Can you elaborate a bit on the use case that make you realize of this? How is this affecting you?

tegefaulkes commented 1 year ago

Sure thing.

What was attempting to do is take a stream of JSON objects that were stringifyed and concatenated together in to form of {jsonObject}{jsonObject} separate them by taking the sub array of the first object. for that I needed to find the offset of the first closing bracket '}'. But since I need to take the subarray of the raw input string, the offset provided by the tokenizer would be wrong if the object contained any special characters. This means I couldn't take the correct subarray without accounting for the special characters which was very inefficient for my use case.

After some digging I worked out that I could extract these top level objects with the JSONParser if I provided and empty string '' for the separator option.

Is the offset used relative to the parsed JSON? If it's parsed at that stage then the string offset wouldn't be useful right? I figured the offset value should be correct for the raw input. Or at least a offset value relative to the raw input should be available.

juanjoDiaz commented 1 year ago

Oh, I see.

You are definitely better off trusting the library to do the split for you. There can be nested objects with their own closing brackets, closing brackets within text, etc... The library and the separator option takes care of all that for you.

I also noticed that you are only interested in the top level object. You should use the paths option set to ['$'] and the parser will only emit the top level object and no sub-objects. So you don't need to check if the parent is null in the stack.

Also, as someone suggested in your PR you should into using the whatwg wrapper instead of the raw library.

Regarding, the offset, I think that you are right about it being wrong and I will look into fixing it.

tegefaulkes commented 1 year ago

Thanks for the suggestion.

juanjoDiaz commented 1 year ago

Following up on this.

You made me doubt but the current logic is correct. The offset property counts the bytes in the stream. Not the characters in the string that those bytes represent.

So in your example

test('testing string 2', async () => {  
  const json = JSON.stringify({"ab\t": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[6])  
})  

// raw string length:  15 // Same length as above
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: 'ab\t', offset: 1 }  
// { token: 4, value: ':', offset: 6 } // This is correct because \t is a single byte character. (See https://www.compart.com/en/unicode/U+0009) 
// { token: 9, value: 'abcd', offset: 7 }  
// { token: 1, value: '}', offset: 13 }  
// " // This isn't the character we expected

For example:

  const json = JSON.stringify({"vд": "abcd"});  
  console.log('raw string length: ', json)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[7])  

// raw string length: 13
// {token: 0, value: "{", offset: 0}
// {token: 9, value: "vд", offset: 1}
// {token: 4, value: ":", offset: 6} // this is correct because д is a single 2-byte character (See https://www.compart.com/en/unicode/U+0434)
// {token: 9, value: "abcd", offset: 7}
// {token: 1, value: "}", offset: 13}
// "a"

So, offset works just as intended. Having a char offset instead of a byte offset, is something that could be added to the library if needed.

What is clear is that documentation could be a lot better 😅

juanjoDiaz commented 1 year ago

Closing as this was clarified. Feel free to reopen if you think that there are still open questions.

juanjoDiaz / streamparser-json

Tokenizer token offset is incorrect #24

Some examples